Re: Preliminary HDF5 Dimension documents

Quincey Koziol wrote:

Hi John,

Hi Quincey, some thoughts on your proposal:

1. A few notes on naming differences between the netCDF and HDF5 data model:
A netCDF *Variable* is a multidimensional array of primitive values, roughly corresponding to a HDF5 *Dataset.*
   Yup.

A netCDF *Dimension *is a named array index. They are globally scoped, so can be shared. A Variable specifies its dimensionality by referencing a set of Dimensions, this set corresponds to an HDF5 *Dataspace. *There is no exact equivilence to a Dimension as i understand it. The fact that Variables can share Dimensions adds an important meaning to netCDF files.
   This document introduces dimensions as an optional method of composing
a dataspace in HDF5, so they ought to be completely analogous to netCDF
dimensions.

sorry, i didnt realize you were defining dimensions seperately from dimension scales. thats very good, from my POV.

   One possible difference is that I wasn't planning on naming the dimensions
within a dataspace.  They were just going to be indexed by their rank within
the dataspace (i.e. the 0th dimension, the 1st dimension, etc).  This could
reference a named dimensions through an indirect dimension (see the shareability
document), but the actual dimensions in the dataspace weren't planned on having
names associated with them.

only shared dimensions need be named.

   Do you think this is an important requirement?  Does the netCDF API
require that the dimensions in a dataspace for a dataset have names, or
will having shared dimensions using the names of dimension objects in the
grouping hierarchy be sufficient?

netcdf only has shared dimesnions, so they are always named.


A netCDF *Coordinate Variable* is a 1D Variable whose name matches its dimension's name, and whose values are monotonic. This corresponds to your proposed *Dimension Scale*. Note that a netCDF Dimension describes array indices, whereas a Coordinate Variable / Dimension Scale describe coordinates values assigned to each index of the corresponding Dimension.
   Yes, I designed the new HDF5 Dimension Scale model to be compatible
with netCDF Coordinate Variables (ideally, Dimension Scales will be a superset of Coordinate Variables). I'm still not totally pleased with the term "scale"
and somewhat lean toward using netCDF's "coordinates" term since that more
accurately describes their true meaning, but since HDF4 used "scale", I may end
up sticking with the term... :-/

2. So, generally I like your Dimension Scale proposal. The main things we need are 1) shared Dimensions even when theres not a coordinate variable (perhaps a Dimension Scale without the values?),
   Actually, the HDF5 Dimensions will be able to be shared by different
dataspaces without involving any Dimension Scales.

good


2) each Dimension Scale must have a name;
   Yes, that's the primary method of indexing them from a dimension.  I
imagine we may have an API function to get the n'th scale, but that's not
a requirement at this point.

good


and 3) a Variable/Dataset can specify its dimensionality/Dataspace by listing the Dimensions (or their names).
   I'm planning on adding API functions for "composing" a dataspace from
dimensions and then that "composed" dataspace could be used to create datasets.

good


3. While 1D Coordinate Variables / Dimension Scales are the common case, there are also datasets that need different kinds of coordinate systems, including multidimensional coordinate variables. I am eager that netCDF / HDF5 can support these, but I think they can be built on top of the current functionality, and so we can leave them out of this discussion so as to keep things from getting too complicated. (for more details on those ideas, see chapter 3.1 of the java-netcdf user manual).
   As I mentioned to Russ and Ed last week, I think that having support for
coordinate systems (I was calling them "multi-dimensional scales" at the time)
is an important feature to include.  I've printed the java-netcdf user
manual and will be using it for reference during further iterations on the
HDF5 dimension scale design to try to include this concept.  I imagine that I'll
associate them with the dataspace directly instead of hanging them off the
dimensions (since the dataspace can be multi-dimensional and the dimensions are
1-D by definition).

   Also, I was considering cutting the ability of dimensions to have multiple
scales associated with them (to simplify things), but glancing through the
java-netcdf information, it looks like that may be an important feature.
What's your opinion about how critical that is and how often it is used?

   Quincey
i think there are 2 interesting examples if you try to handle coordinate systems in a general way:

1. float lat(x,y) and float lon(x,y) assign latitude and longitude coordinates to points on a projection plane. this is the "multidimenensional case"

2. lat(sample), lon(sample), altitude(sample) might be a coordinate system for variable O3(sample). this is the "1D trajectory" case.

So, what i came up with is that a coordinate system for a variable/dataset is a collection of "coordinate axes" which can have any dimensionality, but whose dimensions must all appear in the set of dimensions used by the variable/dataset. Adding this info to the dataspace is exactly right.

Because the common case is that all or most of the variables/datasets in a file use the same coordinate system, its nice to factor this information out. So if the dataspace can be shared and the coordinate system can be associated with the dataspace, that would be party time most excellent.

BTW, a mathematical formulation behind this (a little out of date but useful if you like formalisms) is at
   http://www.unidata.ucar.edu/staff/caron/papers/CoordMath.htm

theres still one piece that you *might* want to tackle. the above is a framework for general coordinate systems. our users generally want georeferencing coordinate systems. this involves identifying which of the coordinate axes correspond to the x,y,z, and t coordinates. this can be a big can of worms, eg is youve ever looked at GIS specs, they are complex. We have developed a set of very simple specs that so far have satisfied most of our datasets, using "attribute conventions" outside any explicit library support. I can understand if you dont want to add any more complications. However I will say that IMHO getting georeferencing coodinate systems clearly specified (ie not having to use attribute Conventions) would be a huge win for our communities, and one thats really doable.