This documents outlines a proposal to create an alternate Netcdf-4 file format targeted to high-performance, READ-ONLY, access. For the purposes of this document, this format will be called NCX.
Limitations of the Existing Netcdf-4 format
It is currently the case the Netcdf-4 file format uses the existing HDF5 file format to store its data. From a high-performance point of view, the HDF5 format is limited in a number of ways.
- It does not support multi-threaded access; currently all API calls must be serialized using a single global lock.
- MPIO support is provided, but is totally embedded in the HDF5 library. There is no ability for user control and optimization.
- The HDF5 file format is completely fixed and opaque and there is limited
support for performance-specific organizations. The two exceptions are:
- Chunking parameterization is allowed to control how data is co-located.
- Compression (on a per-chunk basis) allows data to be compressed thus supporting faster reads.
Rationale for a New NCX Format
What is being proposed is a new format for read-only access to "Netcdf4-like" files that provide the following capabilities.
- A simple-as-possible file format with a specification independent of any implementation.
- Keeping the existing Netcdf-4 data-model.
- Some ability to re-arrange the data in the file to support specific access patterns. This would include keeping the HDF5 chunking and compression concepts.
- Support for community developed tools that can re-organize
In addition, NCX is intended to be sufficiently simple that multiple, independent implementations can be constructed in a variety of programming languages. This is in contrast to the situation with HDF5 where the file format is so complex that there only exists one complete implementation exists: the one provided by the HDF group.
A Draft File Format
The NCX format proposed in this section is preliminary. Alternative proposals are encouraged.
The basic format builds on the concept of a single-file file system format (aka SFFS).
The basic idea is that a single file is organized to contain a file system including a root plus inodes plus data blocks, all within a single file that is treated as as if it was a heap.
The SFFS approach has a number of useful properties.
Simplicity: The basic SFFS layout is relatively simple. As with an on-disk file system, it uses a superblock plus a set of inodes each of which points to a tree of data blocks. Such an organization avoids the complexity of e.g. the HDF5 b-trees while providing a very general data layout.
Dynamicity: As with a normal file system, a file in the SFFS can be extended (or shortened) in size dynamically at the end of the file.
Annotation: Since the SFFS simulates a file system, it is possible to add information about existing information in the SFFS. In effect, one can create a file that provides "annotations" about other files in the SFFS: These annotations can include, for example, indices pointing into an existing file.
Capability for Reorganization: As long as the basic inode structure is maintained, it is possible to move chunks of data around to support better IO performance. One could even redivide the existing data into larger or smaller data chunks.
Mapping Netcdf4 to an SFFS.
Meta-data: The meta data about the netcdf-4 file can itself be contained in a single, virtual file in the SFFS.
Primitive-Typed Variables: Consider a variable consisting of primitive types, of fixed size: ints of various sizes and unsigned or signed or enums or chars. Assume the dimensions are all fixed size (not unlimited).
Such a variable can be easily laid out in a contiguous format, possibly using hdf5 style chunking and compression.
Unlimited Dimensions Case 1 (Initial Unlimited): Extending the previous case, a primitive-typed variable might have one or more unlimited dimensions. For the case of a single, initial unlimited, it can be kept exactly as with a variable with no unlimited. This is because it is possible to dynamically extend a file to accommodate changes in the size of the unlimited dimension.
Unlimited Dimensions Case 2 (Multiple Unlimited): Consider the following. dimensions: d1=..., d2=..., d3=..., du=UNLIMITED; variables: int v(d1,d2,du,d3);
For this case, we have a number of options. One option (assuming read-only as we are) is to start the file containing v with n intra-file offsets pointing to the subparts of the variable defined by the unlimited dimension. That is, for this example, we have an initial index of d1 x d2 offsets, where each offset points to the start of each of the subarrays of size du x d3. This case generalizes to multiple unlimited dimensions in the obvious(?) way.
Note how this differs from the netcdf-3 case where all variables with an unlimited dimension are co-mingled. However also note that we could re-organize this in a variety of ways to support parallel IO for specific access patterns.
String Typed Variables: This is fairly easy in that one can store each string with a preceding count and the strings are stored linearly with some form of index pointing to the offset of each string.
Even simpler, and again because it is read-only, is to store each string using the maximum size string. This produces internal fragmentation, but allows us to treat string as fixed size object.
Opaque Typed Variables: This is essentially the same situation as strings.
VLEN Typed Variables: One approach is to treat each vlen object as a separate file of its own length. Another approach is to use the String approach because we know the maximum size of all the vlens.
Compound Typed Variables: Again we have some options: we could store each compound object as in field order (as with a C struct) with each field following the next.
Alternately, we could store in the equivalent of "column order" where all instances of the first field (assuming an array of compounds) are stored one after another. Then all instances of second field are stored, an so on.
The "process" implied here is as follows.
- The data file is created using the existing read-write model of the netcdf-c library.
- A special program (e.g. nccopy) is used to take original file as a whole and convert it to the NCX format.
The point being that at the point that the NCX file is created, the whole of the dataset is available. This means that, for example, specialized layout of variable-length data (strings, vlens, unlimited) can be achieved because the totality of the data is available. If an attempt was made to write the original dataset piecemeal using the NCX format, the whole of the dataset would not be available, hence it would not be possible to do certain kinds of layout optimizations.
Use of Docker
I considered using docker (esp. docker commit) as an alternative. This has the advantage that one could even include programs into the `file'. However, security considerations made this approach untenable until docker sand-boxing is completely reliable and trusted.