Initial Draft: 2017-08-14
Last Revised: 2017-08-14
Author: Dennis Heimbigner, Unidata
Table of Contents
- Revised NetCDF-3 File Format
- Reading Compressed Data
- The Unlimited Dimension
- Blocks versus Chunks
This document proposes adding compression to the netcdf-3 (aka classic) file format. The proposal has numerous limitations because of the nature of the existing netcdf-3 format. It is also not clear if the effort involved would be worth it.
The biggest problems with compression are these.
- A variable whose data is to be compressed needs to be divided into some set of same-size blocks. This is to avoid having to uncompress the whole variable before accessing any part of it.
- Compressing each of the blocks of a variable produces a new set of blocks of varying sizes. Some forms of compression (e.g. scale+offset) do produce compressed blocks all of the same size, but this is rare. Most standard compression algorithms produce varying size blocks.
The algorithm and data format proposed here requires re-writing an existing netcdf-3 file to move it to the new format. In effect, the re-written file becomes archival (read-only). Some special cases would exist but I will defer their consideration.
The key is to assume that a variable's data has been divided into equal size blocks (except possibly the last one). The compressor will be applied to each block to produce a new block to be written to a specific location in the output file.
The obvious problem with this is that because the compressed blocks are of varying size, it is not easy to provide direct access to the start of the i'th block.
The solution proposed here is to create an index table with file offsets pointing to the start of each compressed block. This index is a map from block# => start of block. This index table is easily constructed in one of two ways.
- Create the index and store as part of the metadata for the variable in the file. This can then be read at one shot into memory (if not too large), or it can be accessed as needed if the index is itself too large to read into memory.
- Do an initial pass of the data in the variable. If we assume that each block is preceded by its size, then a series of probes into the variable's data will allow us to construct the index by random-access reads to a small number of points in the file.
Building such a structure is pretty straightforward. Each input block is read, compressed, and then written to the end of the output file. The file offset of that block is stored in the index table. This process, unfortunately, cannot be easily parallelized.
Reading from a compressed variable requires the following steps.
- Compute the set of blocks that need to be accessed to cover the requested data.
- Locate and read each block using the index table.
- Uncompress the just read block.
- Extract the data of interest.
I speculate that using caching on blocks might provide some speed-up, but that requires testing.
The above algorithm is relatively simple for variables that do not have an UNLIMITED (aka record) first dimension. So, how do we handle variables with an initial UNLIMITED Dimension? Internally, the netcdf-c library assumes that all variables that use the unlimited dimension are co-mingled. That is, the data for all variables for unlimited-dimension == 1 are written, then all the data for all variables for unlimited-dimension == 2 are written, and so on. Also note that the unlimited dimension is assumed to be at the end of the file so that it can be extended when the unlimited dimension is increased. Note also that there can be holes in that data if various variables are not written when the unlimited dimension increases.
If we assume the output, compressed file is read-only, then we know the actual size of the UNLIMITED dimension. Thus we can treat it as a normal variable of some known size. This represents a significant change to the netcdf-3 file format vis-a-vis the unlimited dimension. This needs further thought.
I deliberately used the term "block" instead of "chunk" in this discussion because the two are not the same. In order to understand "block", we need to look at a conceptual layout of a variable's data in linear order. Assuming the standard row-major order, we envision laying out the individual elements of the variable sequentially using that ordering. Given that layout, we then divide that sequential order into equal size blocks, except possibly the last (short) block. Blocking can be specified with a single number indicating the block size.
This differs significantly from netcdf-4/HDF5 "chunking" because a chunk is a subset of the variable along each dimension. Specifying the chunks for a rank R variable requires specifying R numbers: one for each dimension.
This difference comes from different goals. The netcdf-4 chunk is intended to address two issues: (1) speed up by access pattern and (2) division of the data for compression.
The netcdf-3 block, as described here, is purely to divide the data for compression purposes. The access pattern is still the same as with an ordinary netcdf-3 file.
The clear question is: is all of this worth it? The answer depends on a number of factors.
Read-only: Is the read-only/archival assumption acceptable? It assumes that there are a large number of datasets that are written once and then never again. Note that infrequent modifications are possible by copying the archived file to a standard format netcdf-3 file, modifying it, and then re-compressing it.
Implementation: How hard is it to implement? The answer is that as described above and assuming read-only is acceptable, the implementation is not all that difficult.
Compression Only: Is it acceptable to provide compression but not provide improved support for access patterns. As noted, this is the big difference with netcdf-4 chunking, which provides both.
Once one thinks of adding index tables to netcdf-3, one can extend the idea more generally. Note that if we treated a netcdf-3 file as a big malloc style heap, many desirable capabilities become possible for netcdf-3 that are currently only available for netcdf-4. See especially reference #2.