In the last post, we saw that simple deflate compression makes, on average, NetCDF-4 files approx. 1.32 and 2.24 larger than GRIB-1 and GRIB-2 files respectively. In this post we look more closely at these results for GRIB-1 files, and in the next post, for GRIB-2 files.
We have sixteen NCEP model runs in GRIB-1 in our sample. NCEP has been converting their model output to GRIB-2 over the years, so all these datasets will be replaced with GRIB-2 or discontinued at some point in the future. All are available on the Unidata TDS server, if you are interested in getting sample data:
GFS_Alaska_191km_20100913_0000.grib1 GFS_CONUS_191km_20100519_1800.grib1 GFS_CONUS_80km_20100513_0600.grib1 GFS_CONUS_95km_20100506_0600.grib1 GFS_Hawaii_160km_20100428_0000.grib1 GFS_N_Hemisphere_381km_20100516_0600.grib1 GFS_Puerto_Rico_191km_20100515_0000.grib1 NAM_Alaska_22km_20100504_0000.grib1 NAM_Alaska_45km_noaaport_20100525_1200.grib1 NAM_Alaska_95km_20100502_0000.grib1 NAM_CONUS_20km_noaaport_20100602_0000.grib1 NAM_CONUS_80km_20100508_1200.grib1 RUC2_CONUS_40km_20100515_0200.grib1 RUC2_CONUS_40km_20100914_1200.grib1 RUC_CONUS_80km_20100430_0000.grib1
In examining these files in detail, I discovered that they all use GRIB-1 "simple packing", that is, there is no "complex or second-order packing". So, for each GRIB-1 record, the number of bits, N, is chosen and the floating point value (Y) is converted to a N-bit integer (X) using the formula:
Y × 10^D = R + X × 2^E (1) where: Y = original floating point value X = stored n-bit integer R = reference value integer D = decimal scale factor E = binary scale factor
then those N bits are packed into a byte array and written to the GRIB file. For these datasets on average, the bit-packing makes the files 31% the size of unpacked 4-byte floating point data. This depends only on what values of N are chosen, so on average N = .31 * 32 = 9.92 bits are used to represent the data.
We want to know how netCDF-4 file size would compare to GRIB file size. Since netCDF-4 uses deflate compression, we compare deflate to bit packed data. We use deflate level 3 for all these results.
The first question I wanted to know is, does the deflate compression efficiency depend on N? To answer this question, I read the data into float arrays, then I ran the deflate algorithm on the float array, and recorded the size of the resulting "deflated" array. For ease of computing, I used the java.util.zip.Deflater class which implements the deflate algorithm, rather than using the deflator in the netCDF-4 library, which uses the HDF-5 library. I don't know for sure if these are identical, so that is something I still need to check.
I ran this experiment on all the GRIB-1 datasets individually. Since each dataset has a single horizontal domain, the size of each float array is constant over the dataset. Here are detailed results for 2 datasets, both with around 1000 records, and the records use various values of N. Plotted below is the ratio of the deflated length over the bit packed length, against the value of N; a ratio of greater than one means deflate did worse than simple bit packing, and a ratio of less than one means that deflate did better.
As you can see, there's not a clear pattern that depends on the number of bits, but for smaller number of bits there seems to be a wider spread.
Here's a dataset with a larger grid size:
Again, there's no obvious dependency on bit size, but we have a significant improvement over the previous dataset, likely due to the size of the record. Presumably, a dictionary algorithm like deflate needs a large number of points to amortize the cost of the dictionary.
It occurred to me to try running the deflate algorithm on the raw GRIB data, i.e. the bit-packed values themselves rather than the expanded floating point.
The result is around 25% better when running deflate on the bit-packed data then on the floating point data, despite the fact that the same information is present in both. Clearly the floating point expansion introduces some noise that deflate cant squeeze out. Again, there's no clear dependency of the compressibility of the data on the number of bits. To be clear, at this time netCDF-4 can only store deflated floating-point, not bit-packed data.
Here's the overall result on all 16 datasets, showing average ratio of deflation of both floating point and bit packed data, as a function of the number of points in the data array:
It appears that deflate compression gets better as the number of points increases, as would be expected. There may be a jump in compression efficiency between 20K and 60K data points, but more runs are needed to confirm this. Expanding the bit-packed data into floating point makes deflate sizes about 35% larger. Deflation of NCEP GRIB-1 simple packed data could improve GRIB data record sizes by 10-30%.
The values of N, D, and E in equation (1) above vary among variables and for different records for the same variables. I assume that they are chosen to guarantee a specified level of precision, though I'm yet unclear on the details. In the unforgettable words of the GRIB-1 spec:
92.6.2. Data shall be coded using the minimum number of bits necessary to provide for the accuracy required by international agreement. This required accuracy/precision shall be achieved by scaling the data by multiplication by an appropriate power of 10 (which may be 0) prior to forming the non-negative differences, and then using the binary scaling to select the precision of the transmitted value.
Make it so.
Stay tuned for the more interesting case of GRIB-2 datasets.