Both GRIB1 and GRIB2 have an optional bitmap section which is a bit array for marking missing data. There is a single bit for each data value, so a data array of N points requires N/8 bytes for the bitmap array. A bit value of 0 indicates that the data is missing at that point, so that data value doesn't have to be stored.
Seems like that has to be a good idea, and the more missing values there are, the better. Right?
Turns out that combined with compression, its not such a good idea, and the more missing values there are, the worse it is. The problem is that the bitmap section is not compressed, only the data section is.
Heres an extreme case, NOAA wave watch data off the west coast and including Hawaii. The grid size is 526 x 736, giving 387,136 points, most of which are missing:
Examining a random record in ToolsUI IOSP/GRIB2/GRIB2data, we see that there are only 11,127 non-missing data values. These are compressed nicely into 7702 bytes by GRIB wavelet compression. The problem is that we need 387136/8 = 48392 bytes to store the bitmap, which is not compressed. That makes the entire GRIB message 56270 bytes.
I didn't really think about it until I noticed with my compression tests that other compression schemes are beating GRIB by factors greater than 20 on some records, which is surprising. Using bzip2, for that example record, the entire array including missing values is compressed to a size of 9576 bytes. Deflate (zip) compression compresses it into 25786 bytes. LZMA (7zip) gets it down to 8994, which is 6 times smaller.
Overall, on that entire file, the estimated file sizes for various compression schemes are:
Standard compression algorithms are very, very good at compressing repeated bytes of data. The more missing data, the better they do. On that file, every record was at least a factor or 2 smaller for bzip2 and 7zip compression, than the GRIB record size.
I didn't test this, but its likely the JPEG-2000 wavelet compression would do much better at compressing the data with missing values in it, compared to the current technique of removing the missing values with an uncompressed bitmap.
(PS: I just noticed that bitmaps can be shared between records, by using the 'repeating section' feature of GRIB2. This will mitigate the above conclusion by an unknown amount).
Here's a gratuitous shot of the ToolsUI IOSP/GRIB2/GRIB2data tab on that example file (click on image for more detail):
If you right click on one of the records, and choose "Compute Scale/offset of data" from the context menu, you can see some alternate compression sizes for that record. For example the example record shows:
nbits = 10 npoints = 387136 width = 1022 (0x3fe) scale = 0.0100000 resolution = 0.00500000 range = 10.230000 actual computed dataMin = 1.390000 1.390000 dataMax = 11.250000 11.620000 actual range = 9.860000 scale_factor = 0.00964775 add_offset = 1.39000 max_diff = 0.00481459 avg_diff = 7.18643e-05 std_diff = 0.000479742 Compression number of values = 387136 uncompressed as floats = 1548544 uncompressed packed bits = 483920 grib data length = 7702 grib msg length = 56270 deflate (float) compressedSize = 19374 ratio floats / size = 79.928978 ratio packed bits / size = 24.977806 ratio size / grib = 0.344304 deflate (scaled ints) compressedSize = 16339 ratio floats / size = 94.775932 ratio packed bits / size = 29.617479 ratio size / grib = 0.290368 bzip2 (floats) compressedSize = 9771 ratio floats / size = 158.483673 ratio packed bits / size = 49.526150 ratio size / grib = 0.173645 bzip2 (scaled ints) compressedSize = 9455 ratio floats / size = 163.780426 ratio packed bits / size = 51.181385 ratio size / grib = 0.168029
Lots of other info is available from that context menu. Welcome to the inner workings of GRIB sausage making.