« HDF5 Dimension Scale... | Main | HDF5 Dimension Scale... »

27 July 2012

Last time we said that in order for an HDF5 Dimension Scale to represent a shared dimension, the following must be true:

Dimensions are associated with only one Dimension Scale.
A Dimension Scale is one dimensional.
All dimensions have the same length as the shared Dimension Scale.

The netCDF-Java and C libraries look for Dimension Scales that satisfy these conditions in any HDF5 file, not just those written with the netCDF-4 library. So applications can create shared dimensions using the HDF5 API directly. That's good news for those who are not allowed to use the netCDF-4 library. (But the application has to be a bit smarter, so you might as well let netCDF-4 do the work for you if possible).

You might hope that one could also use Dimension Scales to represent coordinate functions. For example, in our previous blog we had this example:

   dimensions
    lat = 180;
    lon = 360;
    z = 56;
    time = 365; 

  variables:
    float data(time, z, lat, lon);
      data:coordinates = "lat lon z time";
    float lon(lon);
    float lat(lat);
    float z(z);
    int time(time);

And the Dimension Scale attribute associated with the data variable looks just like the CF coordinates attribute:

 DIMENSION_LIST = "time", "z", "lat", "lon";

It turns out that the example above is really a special case of separable coordinate functions, where each coordinate variable is 1 dimensional and each has a unique dimension associated with it. This is an example of the gridded data type, and is familiar to most of us. Unfortunately it is the only one that many data writers understand.

The previous blog covered the general form of coordinate functions, namely that they must have the same domain as the data variables, meaning they have the same (or a subset of the) dimensions of the data variable. As a canonical example, consider swath data:

 dimensions
    scan = 3253;
    xscan = 980;
 variables:
    float data(scan, xscan);
      data:coordinates = "lat lon alt time";
    float lon(scan, xscan);
    float lat(scan, xscan);
    float alt(scan, xscan);
    int time(scan);

Here we have 2D coordinate functions lon, lat and alt, which are not associated with just one dimension. The problem with HDF5 Dimension Scales is that they are associated with a single dimension of the variable, rather than the set of dimensions for the variable, i.e. the variable's domain. This is a fatal flaw if you are trying to use Dimension Scales to represent coordinate functions is a general way. You are going to need another layer to specify and implement the correct semantics for coordinate functions.

The netCDF-4 data model alone doesn't have this either, but if you add the CF coordinates attribute convention on top of netCDF, that is sufficient. Note that its so simple even rocket scientists can do it:

 data:coordinates = "lat lon alt time";

Of course you can and should (are you listening satellite data providers?) use CF Conventions with HDF5, as long as you also implement shared dimensions (using Dimension Scales as described above). At this point the distinction between the netCDF-4 and HDF5 data files all but disappears, and finally, the taxpayers are getting their money's worth, and the heroic programmers sip well deserved Mai Tais on the tropical beach.

As long as we are living in alternate realities, what else might we ask for?

The HDF5 library should provide shared dimensions. Using Dimension Scales as above would work fine. It's just a matter of saying that in an HDF5 file, if one uses unique, one dimensional Dimension Scales, then these represent shared dimensions. Such is the power of words that come from the right mouth.
The HDF5 and netCDF-4 libraries should add generalized coordinate systems to their data model. The CF coordinates attribute would work fine. This would just be a matter of saying that coordinates is a reserved attribute, describing its meaning, and adding diagnostics to ensure that the dimensions of the coordinate functions satisfy the constraints as defined above.

The HDF5 dimensions scale design doc brings up some use cases which aren't covered in the netCDF / CF coordinate data model. The first is that coordinates are often easily calculated, so it would be nice if they could be represented as an algorithm, instead of as a sampled function. Coordinates are often regularly spaced, so even a function as simple as a starting value and increment would very often be useful.

The other use case is when the sampling of coordinates stored are different from the sampling of the data values. The main use case is 2D satellite data, where the lat/lon points are stored every nth data point, presumably to save storage. The idea is to interpolate to the other points. The HDF-EOS library has this functionality. (Note that this can also be modeled as an algorithm.)

We are going to try adding this functionality to the CDM library just as soon as I finish this Pina Colada and get back from scuba diving. Stay tuned.

Next: HDF5 Dimension Scales - Part 3

Posted by $entry.creator.screenName