« An Essay on Domain... | Main | HDF5 Dimension Scale... »

27 July 2012

When we created the netCDF-4 file format on top of HDF5, we asked the HDF group to add shared dimensions. They said no, and instead added dimension scales, which at that point were in the HDF4 data model, but not in HDF5. In retrospect, I think we should have worked harder to come to a mutual agreement. The lack of shared dimensions in HDF5 makes HDF5 not a strict superset of netCDF-4.

In this post I'm going to review dimensions and dimension scales. I'll try to convince you that the lack of shared dimensions in HDF5 means that you really should use netCDF-4 instead of HDF5 for earth science data.

In the netCDF data model, a variable is a container for a multidimensional array of data. The shape of the array is defined by the list of dimensions for the variable. A dimension has a name and a length. When more than one variable uses the same dimension, we say that the dimension is shared.

What does it mean for a dimension to be shared? A variable can be thought of as representing a function on a set of points called the domain of the function. A variable's set of dimensions are simply a representation of its domain. From mathematics, we represent the function like

f : D -> R

meaning for each point in the domain set, D, a function assigns a value from the range set, R. How does this relate to netCDF variables? Let's use this example:

  dimensions      lat = 180;      lon = 360;      z = 56;      time = 365;       variables:      float data(time, z, lat, lon);

The data variable is defined for a set of index values defined by the dimensions of the variable. We call this the variable's domain in index space. (Index space is an abstract lattice of points in n-dimensional space, where n is the number of dimensions.) What we are usually more interested in is mapping data to locations on the physical earth. To do so, we use coordinate functions. These are simply netCDF variables that play the role of assigning, for each data point, a location in coordinate space (in this example, physical space and time). In order to do this, these coordinate functions must have the same domain as the data variable, and so must have the same dimensions. So, in our example this might look like:

  dimensions
    lat = 180;
    lon = 360;
    z = 56;
    time = 365; 

  variables:
    float data(time, z, lat, lon);
      data:coordinates = "lat lon z time";
    float lon(lon);
    float lat(lat);
    float z(z);
    int time(time);

Here we use the CF convention's coordinates attribute to assign coordinate variables to the data variable. For some random point, say, time=5, z =12, lat=22, lon=123, the location of data(5,12,22,123) is at time(5), z(12), lat(22), lon(123). Its pretty clear that coordinate functions can only have dimensions that are shared with the variable. For example, suppose you had lat(sample). Since the sample dimension doesn't appear in the data, there is no way to assign a unique lat value to a data point. On the other hand, there is no problem with the coordinate functions using only a subset of dimensions (as in this example) or even a scalar variable — for example if all the data was at a single time coordinate.

To summarize, the essence of shared dimensions is that they indicate that two variables have the same domain, and this is needed to assign coordinates for sampled functions. If you like UML (and who doesn't?) here is the CDM's UML diagram for coordinate systems.

Ok, let's get back to the HDF5 data model. HDF5 variables (aka datasets) don't use shared dimensions, but define their shape with a dataspace object, which is defined separately for each variable. So there is no formal way in the HDF5 data model to indicate that two variables share the same domain. As we'll see, dimension scales help some, but not enough.

Each variable in HDF5 defines its shape with a dataspace, which is essentially a list of private dimensions for the variable. A Dimension Scale is a special variable containing a set of references to dimensions in variables. Each referenced variable has a DIMENSION_LIST attribute that contains, for each dimension, a list of references to Dimension Scales. So we have a two-way, many-to-many linking between Dimension Scales and Dimensions:

  DimScale <------> Dimension

The HDF5 Dimension Scale API mostly just maintains this two way linking, plus allows the links to be named.

So it appears that by using Dimension Scales, we now have shared dimensions in HDF5: namely, all the dimensions that share the same Dimension Scale are ... the same!

Unfortunately nothing requires the "shared" dimensions to have the same length as the dimension scale, or have the same length as any of the other dimensions that are associated with the dimension scale, or that the dimension scale even has the same rank as an associated dimension. The HDF5 dimensions scale design doc is quite explicit that any other semantics are not part of the HDF5 data model, and must be added by other layers:

It is important to emphasize that the Dataspace of a Dataset has no intrinsic meaning except to define the layout in computer storage. Dimension Scales may be used to store application specific labels to the positions in the stored data array, i.e., to add application specific meaning to the dimensions of the dataspace. A Dimension Scale is an object associated with one dimension of a Dataspace. The meaning of the association is left to applications. The values of the Dimension Scale are set by the application to reflect semantics of the data, for example, to associate coordinates of a reference system with positions on the dimension.

All we get with dimension scales is a many-to-many association of a variable's private dimension with a specially marked variable called a Dimension Scale. It is up to the user (or a layer like netCDF-4) to add semantics and maintain consistency. This was a deliberate choice by the HDF Group, presumably in the name of generality.

Obviously, other application layers like netCDF-4 can layer shared dimensions on top of HDF5 Dimension Scales. The minimum requirements for shared dimensions are that:

Dimensions are associated with only one Dimension Scale.
A Dimension Scale is one dimensional.
All dimensions have the same length as the shared Dimension Scale.

Those are the things that a program can check for. But the intention of the data writer is crucial, because the real requirement for shared dimensions is that the dimensions represent the domain of the function, and the dimension scale values represent the coordinates for that dimension.

But if the HDF5 data model does not include that meaning, if a data writer can make dimension scales mean anything they want, then in a strict sense, without knowing more, software can't assume shared dimensions, so you can't define CF-style coordinate functions. And that is why you should use the netCDF4 library — in order to get that essential functionality.

The HDF5 design does seem to recognize the use case of a 1-D Dimension Scale whose length matches the dimension:

A simple case is where the Dimension Scale s is a (one dimensional) sequence of labels for the dimension ix of Dataset d. In this case, Dimension Scale is an array indexed by the same index as in the dimension of the Dataspace. For example, for the Dimension Scale s, associated with dimension ix, the ith position of ix is associated with the value s[i], so s[i] is taken as a label for ix[i].

In my next blog post I will show how netCDF-4 uses Dimension Scales, and what can be done in a practical sense with HDF5 data files.

Next: HDF5 Dimension Scales - Part 2

Posted by $entry.creator.screenName [ Comments [1] ]

Comments:

I'm really glad you're writing these articles that contrast HDF5 and netCDF4. I thought the differences in dimension definitions might be just semantic. But you explain how the definitions lead to practical differences. Good stuff. I need to read it a few more times before it sinks in...

Posted by Charlie Zender on August 10, 2012 at 09:41 AM MDT #