Part 6: Converting GRIB to NetCDF-4

We decided on the basis of tests in Java that bzip2 is a good candidate compression alternative to deflate, which is currently the only standard compression option in netCDF-4. My colleague Dennis Heimbinger created a branch of the netCDF C library that uses the 1.06 bzip2 library from bzip2.org, and my colleague Ward Fisher built it for Windows for me to test. Ward had to build it from source, since no prebuilt Windows version exists.

With that substantial help, I was able to copy the NCEP sample model GRIB files to netCDF-4 files with bzip2 compression. To make sure I was getting maximum compression, I used level=9, which uses a 900K block size. However, tests show that this can be lower without increasing the file size. A few of the files had to be excluded because they were not being completely read by the netCDF-Java reader completely, and so the file ratios were misleading.

 Ok, here are the results:

 

As you see, the range of compression ratios goes from ~ .4 to 1.8. The average is 1.12; for GRIB-1 its .92, and for GRIB-2 its 1.20. These are on the plain ole float arrays as read from the GRIB files. The four lowest values among the GRIB-2 files are simple bit packed, not JPEG-2000 compressed.

There a chance that the bzip2.org C library may be slightly (2-5% ?) less efficient than the 7zip and tadaki Java bzip2 libraries. So it needs to be investigated if another bzip2 C implementation might do better.

As previously blogged, Java prototyping indicates that there may be another 7-10% to be gained by doing floating point bit shaving or conversion to integer arrays using scale/offset. We are considering adding these to the netCDF-4 library as "lossy compression" options.

Meanwhile we can say with some confidence that bzip2 compression can get us to within 20% of GRIB compression, on average for NCEP model GRIB output, your mileage may vary, offer void where taxed or prohibited. A good enough result for now. Thanks again to Dennis and Ward who jumped up to help in time to present these results at the ECMWF Workshop on Closing the GRIB/NetCDF gap next week.

Well done mates, that was massive indeed.  Now stay tuned to the BBC for more cricket scores.

 

Comments:

Very interesting discussion, netCDF 4 compression is an important topic. You mention that on average the netCDF ratio is within 20% of GRIB 2 but I see that there are many datasets which are up to 170% of the original size. For users of these less compressible datasets the sizes could be a big hindrance despite the average size being lower. Are there any patterns among the datasets that do not compress as well? Have there been any developments since September?

Posted by Aaron Braeckel on March 10, 2015 at 02:53 PM MDT #

Dont know where the variation comes from; might assume JPEG2000 can take advantage of 2D correlations in ways bag-of-byte algorithms like bzip2 and deflate cant.

We have some new compression libraries to evaluate. May be some time before I get to it, so if anyone wants to help, I can get you started.

Posted by John Caron on March 10, 2015 at 04:02 PM MDT #

Post a Comment:
Comments are closed for this entry.
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« October 2024
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today