Finally I'm starting to look at the possibilities of rewriting GRIB files into NetCDF/CF format. I am going to use the NCEP model data from the IDD as the target datasets, and explore the issues that arise in converting them to netCDF-4.
One of the strengths of GRIB is its ability to store data compactly. The NCEP data mostly uses bit packing and JPEG-2000 wavelet compression. For each 2D slice to store, the range (min, max) of the data along with the number of bits to store (N) determines a scale and offset, which is used to convert the floating point value to an N-bit integer. I would guess that N is chosen for each variable based on the known precision of the variable, although in principle N can be different for each 2D slice of data. Because the scale/offset can be different for each 2D slice of data, this algorithm works well for atmospheric data where there can be large differences in magnitude of a variable at different vertical levels. The scale/offset used by the netCDF classic model, in contrast, must be applied to the entire data, and therefore is less accurate. Effectively GRIB gives us reletive precision ("4 decimal points of accuracy") and netCDF scale/offset gives us absolute precision ("accurate to .01 meters").
So the question is, if we rewrite GRIB into netCDF-4 what size are the netCDF files reletive to GRIB? The standard compression algorithm for NetCDF-4 / HDF5 is the deflate algorithm, also used by gzip, zip, png, HTTP, etc. Deflate is a lossless dictionary encoder, which looks for patterns in the data and stores common patterns efficiently. A completely random set of numbers won't have patterns, and so will not compress. Floating point data can be difficult to compress because the lower bits of the mantissa are essentially random. However, the GRIB data already has those random bits removed when its converted from floating point to N-bit integers. So there's hope that deflate might do a good job on GRIB data.
To start I will use netCDF-4 "out of the box", and see what compression ratios can be achieved, and then investigate what, if anything, can be done to make netCDF files as small as GRIB.
When using deflate, you can choose deflate levels of 0-9, with 0 meaning no compression and 9 being the highest compression, with corresponding trade-offs with the time it takes to compress. HDF-5 also offers a "shuffle" option which rearranges the bytes of the data to make the data compress better. The first experiment I made was to rewrite a GRIB file into netCDF with and without shuffle, and with different deflate levels:
This chart plots the compression ratio (bigger is better) of the resulting file against the time to rewrite the entire file (read GRIB, compress, write netCDF-4). I used a single NCEP GRIB-2 file (WW3_Global_20140804_0000.grib2) with original size of 115 MB, which uses JPEG-2000 compression. I used the netCDF-Java library version 4.5.2, with a JNI interface to the netCDF C library version 4.3.2. Tests on other files lead me to believe this result is typical. (The time of the first point of the shuffle data is an artifact of the file cache spin-up.)
The results show that shuffle makes the file about 10% bigger (3.5/3.2 = 1.1). As the deflate level goes from 1 to 4, the file gets about 10% smaller, at a cost of about 5% extra time (remember this includes the entire copy). Deflate levels greater than 4 don't seem to make the file much smaller for the extra time spent compressing (level 9 was off the chart).
In conclusion, for these kinds of files at least, don't use the shuffle filter and use a deflate level of 1-4.
Next time: netCDF-4 vs GRIB file sizes.