« Compression by Bit-S... | Main | Part 5: Converting... »

05 September 2014

Both GRIB1 and GRIB2 have an optional bitmap section which is a bit array for marking missing data. There is a single bit for each data value, so a data array of N points requires N/8 bytes for the bitmap array. A bit value of 0 indicates that the data is missing at that point, so that data value doesn't have to be stored.

Seems like that has to be a good idea, and the more missing values there are, the better. Right?

Turns out that combined with compression, its not such a good idea, and the more missing values there are, the worse it is. The problem is that the bitmap section is not compressed, only the data section is.

Heres an extreme case, NOAA wave watch data off the west coast and including Hawaii. The grid size is 526 x 736, giving 387,136 points, most of which are missing:

Examining a random record in ToolsUI IOSP/GRIB2/GRIB2data, we see that there are only 11,127 non-missing data values. These are compressed nicely into 7702 bytes by GRIB wavelet compression. The problem is that we need 387136/8 = 48392 bytes to store the bitmap, which is not compressed. That makes the entire GRIB message 56270 bytes.

I didn't really think about it until I noticed with my compression tests that other compression schemes are beating GRIB by factors greater than 20 on some records, which is surprising. Using bzip2, for that example record, the entire array including missing values is compressed to a size of 9576 bytes. Deflate (zip) compression compresses it into 25786 bytes. LZMA (7zip) gets it down to 8994, which is 6 times smaller.

Overall, on that entire file, the estimated file sizes for various compression schemes are:

	size (MB)
GRIB	45.72
deflate	28.29
bzip2	12.28
7zip	11.47

Standard compression algorithms are very, very good at compressing repeated bytes of data. The more missing data, the better they do. On that file, every record was at least a factor or 2 smaller for bzip2 and 7zip compression, than the GRIB record size.

I didn't test this, but its likely the JPEG-2000 wavelet compression would do much better at compressing the data with missing values in it, compared to the current technique of removing the missing values with an uncompressed bitmap.

(PS: I just noticed that bitmaps can be shared between records, by using the 'repeating section' feature of GRIB2. This will mitigate the above conclusion by an unknown amount).

Here's a gratuitous shot of the ToolsUI IOSP/GRIB2/GRIB2data tab on that example file (click on image for more detail):

If you right click on one of the records, and choose "Compute Scale/offset of data" from the context menu, you can see some alternate compression sizes for that record. For example the example record shows:

 nbits = 10   npoints = 387136   width = 1022 (0x3fe)    scale = 0.0100000    resolution = 0.00500000    range = 10.230000                actual    computed   dataMin = 1.390000 1.390000   dataMax = 11.250000 11.620000   actual range = 9.860000   scale_factor = 0.00964775   add_offset = 1.39000     max_diff = 0.00481459   avg_diff = 7.18643e-05   std_diff = 0.000479742    Compression   number of values = 387136   uncompressed as floats = 1548544   uncompressed packed bits = 483920   grib data length = 7702   grib msg length = 56270    deflate (float)   compressedSize = 19374   ratio floats / size = 79.928978   ratio packed bits / size = 24.977806   ratio size / grib = 0.344304    deflate (scaled ints)   compressedSize = 16339   ratio floats / size = 94.775932   ratio packed bits / size = 29.617479   ratio size / grib = 0.290368    bzip2 (floats)   compressedSize = 9771   ratio floats / size = 158.483673   ratio packed bits / size = 49.526150   ratio size / grib = 0.173645    bzip2 (scaled ints)   compressedSize = 9455   ratio floats / size = 163.780426   ratio packed bits / size = 51.181385   ratio size / grib = 0.168029

Lots of other info is available from that context menu. Welcome to the inner workings of GRIB sausage making.

Posted by $entry.creator.screenName [ Comments [2] ]