The steady state of disks is full. --Ken Thompson
Introduction
From our support questions, it appears that the major feature of netCDF-4 attracting users to upgrade their libraries from netCDF-3 is compression. The netCDF-4 libraries inherit the capability for data compression from the HDF5 storage layer underneath the netCDF-4 interface. Linking a program that uses netCDF to a netCDF-4 library allows the program to read compressed data without changing a single line of the program source code. Writing netCDF compressed data only requires a few extra statements. And the nccopy utility program supports converting classic netCDF format data to or from compressed data without any programming.
Why can't netCDF data compression be handled with a popular general-purpose data compression utility, such as zip, gzip, or bzip2? Those utilities work fine for whole-file compression, but can't be applied to accessing a small slab of data from a large multidimensional data set without uncompressing the whole file first. Similarly, updating a small number of data values in a large file would require first uncompressing the whole file, updating the values, and then recompressing the file. The ability to access a small subset of data without first accessing all the data in compressed form is absolutely necessary for acceptable performance.
Chunking (sometimes called "multidimensional tiling") is one way to support both fast access to subsets of data and efficient data compression. (Another way is space-filling curves, which may be the subject of a later blog.) Earlier blogs in this series discussed chunking, so I'll assume some familiarity with chunking concepts in the following.
Also, we're only dealing with lossless compression, so that every bit of the original data can be recovered. Note that in this context, no algorithm can guarantee to compress every data file. The shortest representation of a sequence of random bits is almost certainly the sequence itself.
Compression 101: the Short List
Some Rules of Thumb for netCDF compression:
- Variable data is compressed, not metadata such as attributes.
- The data writer must specify a deflation level, from 1 to 9.
- Level 1 is intended to compress faster than level 9.
- Level 9 is intended to compress smaller than level 1.
- The data writer may opt for shuffling (byte interlacing).
- Shuffling often improves compression for little cost.
- Use of netCDF unlimited dimensions may degrade compression.
- A compressed variable must be chunked.
- Chunk shapes may have large effects on compression and performance.
- Default chunking may not result in good compression.
- Large chunks improve compression, but slow subset access.
- To experiment with netCDF compression and chunking, use nccopy.
An Expanded Example
Consider an example of real-world processed data, a 3 MB netCDF classic format file in proc.nc. This is not a huge file, but an archive of a few thousand of these could take enough space to make compression worthwhile.
First, let's just try a level 1 compression to see if it saves any disk space:
$ nccopy -d1 proc.nc tmp.nc # 15.9 sec $ ls -l proc.nc tmp.nc -rw-rw-r-- 1 russ ustaff 3843892 Jul 15 2013 proc.nc -rw-rw-r-- 1 russ ustaff 5026525 Mar 31 12:12 tmp.nc
Wait, what?! It looks like either the data is just random numbers (essentially incompressible), netCDF-4 does expansion instead of compression, or we've done something wrong. At least nccopy seemed to take an unusually long time, about 16 seconds, to do a bad job of compression on this relatively small file. Let's see if any of our Rules of Thumb above provide a hint about what might be going on.
There are lots of variables and attributes, hence lots of metadata in the file, which will not be compressed. We can get a rough measure of how much space the attributes take by just looking at the size of the CDL schema that "ncdump -h" provides:
$ ncdump -h proc.nc | wc -c # count bytes of metadata 33414
But that's still only about 1% of the size of the file, so there should be plenty of data to compress. Maybe we just need to try a higher level of compression.
$ nccopy -d9 proc.nc tmp.nc # 1.6 times smaller, 29.5 sec $ ls -l proc.nc tmp.nc -rw-rw-r-- 1 russ ustaff 3843892 Apr 1 14:56 proc.nc -rw-rw-r-- 1 russ ustaff 2347677 Apr 1 14:59 tmp.nc
The compressed file is at least smaller than the original, but the level 9 deflation did take about 85% longer than level 1 deflation. It turns out we can do significantly better.
First, let's try the "shuffling" option. Despite the name, this doesn't really just randomly shuffle the data bytes in the file to see if they compress better. Instead, it stores the first byte of all of a variable's values in the chunk contiguously, followed by all the second bytes, and so on. If the values are not all wildly different, this can make the data more easily compressible. For example, if the data are all relatively small non-negative integers, their first bytes will all be zero, and any level of deflation will do well with a run of zero bytes.
$ nccopy -d9 -s proc.nc tmp.nc # 1.5 times smaller, 31.0 sec $ ls -l proc.nc tmp.nc -rw-rw-r-- 1 russ ustaff 3843892 Apr 1 14:56 proc.nc -rw-rw-r-- 1 russ ustaff 2532405 Apr 1 15:05 tmp.nc
That's a disappointment, shuffling didn't help at all in this case. In fact, level 9 deflation works better without shuffling, though it turns out to make level 1 deflation a little less bad.
The next Rule of Thumb is about unlimited dimensions interfering with compression. If we look more closely at proc.nc using "ncdump -h", it turns out that its only dimension, "time", is unlimited. That's convenient for variables that we want to extend incrementally, or process one record at a time, but once we put data into an archive, it is often better to store the values of each such variable contiguously, as a time series.
The nccopy utility has an option "-u", that eliminates unlimited dimensions by just making them ordinary fixed-size dimensions. This can be used at the same time as compressing:
If we try that first, compression is both significantly better and faster:
$ nccopy -u proc.nc procu.nc # fix unlimited dim, 3.8 sec $ nccopy -d1 procu.nc tmp1.nc # 2.64 times smaller, 1.3 sec $ nccopy -d9 procu.nc tmp9.nc # 2.84 times smaller, 2.4 sec $ ls -l procu.nc tmp1.nc tmp9.nc -rw-rw-r-- 1 russ ustaff 3843892 Apr 1 15:15 procu.nc -rw-rw-r-- 1 russ ustaff 1454141 Apr 1 15:17 tmp1.nc -rw-rw-r-- 1 russ ustaff 1353057 Apr 1 15:17 tmp9.ncThe test above created an intermediate file, procu.nc, that had the unlimited dimension changed to fixed size, which reorganized the data in the file to be contiguous for each variable. That way the unlimited dimension only needed to be eliminated once, for two compression timings. However, nccopy can handle both "-u" and "-d" options at the same time instead, eliminating the need for an intermediate file, by taking more total time:
$ nccopy -u -d1 proc.nc tmp1.nc # 2.64 times smaller, 9.0 sec $ nccopy -u -d9 proc.nc tmp9.nc # 2.84 times smaller, 10.2 sec
The next few Rules of Thumb are about chunking and chunk shapes. We could look at the default chunking used in the compressed file with "ncdump -h -s tmp.nc" and try to improve it, but that's material for another blog.
In the mean time, try to avoid inflated expectations ...
Russ, great article, as usual. Is there a tools to show the chunk sizes of existing netcdf4 files?
Posted by Rich Signell on April 08, 2014 at 01:56 PM MDT #
Thanks, Rich.
The easiest way to show chunk sizes is with the ncdump "-s" option. In combination with the "-h" option, it shows just the header info plus the "special" virtual attributes, which include chunk shapes, compression level, etc.
Posted by Russ Rew on April 08, 2014 at 08:58 PM MDT #
Does this new version (v4.3.21) deprecate the Nujan library
Posted by Dan on May 15, 2014 at 06:50 PM MDT #
"is almost certainly" rather than "is simply".
Thanks for a very useful article.
Posted by me on August 05, 2015 at 04:55 AM MDT #
"The shortest representation of a sequence of random bits is simply the sequence itself."
Sadly, we must define "Random"... because we probably do not mean Random in the theoretical sense.
In practice, the Shannon/Nyquist camp would say such a measure is based on entropy, and this one of the better upper bounds known to me. Kolmogorov would argue its the smallest program that would generate your bitsream, potentially a much lower bound for non-trivially sized bitstream.
Following Kolmogorov, the shortest representation of a sequence of pseudo-random bits is simple the program of it's generator, and he would be correct for any bitstream longer than the code for its RNG.
"almost certainly" depends strongly on being almost random, what knowledge we have a priori, or can be learned a posteriori.
For the common scientific data I see Netcdf being applied to, much data typically have similar dynamic range for long runs, and so their numerical representation then mapped with a scheme like shuffle would imply long contiguous runs of the same bit, which is optimal for compression. Actually in this case, one could potentially beat deflate by simply accounting for the location changing parity until the runs become small enough to change back to another scheme (or none at all). Perhaps a future method in the making for someone so inclined. :)
Posted by Garth Vader on January 22, 2016 at 05:52 AM MST #