Re: coordinate systems in netcdf (again)

Russ Rew (russ@buddy.unidata.ucar.edu)
Sun, 06 Jul 1997 22:59:50 -0600

Stephen and Jason,

> Thanks for your comments on our proposal. Your message appears to
> contain two main themes - bounding coordinates, and dimensional
> attributes. Our comments on each of these are as follows:
>  
> Bounding coordinates:
>  
> You give the example of layers in the atmosphere, and the need to
> store coordinates for the top and bottom of these layers. Along
> the same lines, more generally, Gregory, Drach, and Tett say
> in section 21 of their proposal
>  
> > NEW: Along a dimension, the values might relate to points (at the coordinate
> > values) or to contiguous or non-contiguous cells. The boundaries of the
> > cells should be defined as well as the cell coordinate values. The
> > convention is to define an additional two-dimensional ``boundary coordinate
> > variable'' with a left-hand dimension (trailing dimension in Fortran terms)
> > of size two.

What I meant to illustrate with my layer example, although it didn't
come across very well, was that coordinates may be useful even if they
are more general than monotonic values along an axis.  In particular,
atmospheric layers can (and do) overlap, and overlapping layers cannot
be ordered, in general.  I did not mean to require matching boundaries
on multiple layers.  I was just looking for a real example of a
coordinate-like variable that required multiple component values.

> Their proposal and your example both only deal with the 1-dimensional
> case. In two dimensions, a 'cell' will be defined by 4 points, and
> in the general curvilinear case, each such point is specified by 2 coordinate
> variable values (x and y, for example). In 3 dimensions, 8 points are needed
> (defining the corners of a 'cube' for the want of a better word), although
> particular cases (such as some model grids) might allow you to simplify this
> (4 (x,y) points and 2 values of z, for example). In general, the specification
> of the bounds of a 'cell' becomes quite messy for higher dimensions, and we
> don't (yet) have a good, general proposal for addressing this problem. Our
> model output files actually do store this information - we have 4 distinct
> horizontal grids stored in the files, representing cell centres, cell corners,
> and the centres of two adjacent faces, but at the moment, the intelligence
> to interpret these is hard wired into our processing software.

I was trying to present a plausible example in which 2 values are
required to specify a single coordinate value (a layer).  A more general
higher-dimensional analogue of overlapping layers would be
higher-dimensional connected sets, not necessarily rectangular cells;
even in only two dimensions, it might require an infinite number of
values to represent such a general set.  And maybe this extrapolation to
higher dimensions shows that I have an invalid generalization for
coordinates, but the layer example has proved useful for representation
with a single netCDF dimension.  And I think it would also be useful to
have a single netCDF dimension for representing geographic/political
regions, such as states, countries, or provinces.  Certainly one could
think of representing climatology data by region using one netCDF
dimension for "region", where each region had a variable-length name
stored in a corresponding "region" variable.  I'm hoping such a "region"
variable would qualify as a coordinate variable under the right
generalization of the concept of coordinate.

> Dimension attributes:
>  
> A possible drawback of our proposal is the need to maintain "coordinates"
> attribute strings for each data variable, even when several data variables
> have the same set of coordinate variables associated with them. Your proposal
> is to eliminate this possible duplication by using global attributes having
> dimension names, which list coordinate variables. As well as the drawbacks
> you mention, we see several other problems with this approach:
>  
> Firstly, almost any variable might be considered to be a coordinate variable.
> For example, given a file fragment as follows:
>  
>   dimensions:
>       d1 = ...;
>       d2 = ...;
>       d3 = ...;
>  
>   variables:
>       data1(d1,d2,d3);
>       data2(d1,d2,d3);
>  
>       coord1(d1);
>       coord2(d2,d3);
>       coord3(d2,d3);
>       coord4(d2,d3);
>       coord5(d2,d3);
>  
> Your scheme would have global dimension attributes as follows:
>  
>      :d1 = "coord1";
>      :d2 = "coord2 coord3 coord4 coord5";
>      :d3 = "coord2 coord3 coord4 coord5";

Evidently, I wasn't very clear in describing my scheme.  It was not my
intent to list all coordinate variables that use a dimension as
coordinate variables for that dimension, but just to list a set of
variables whose values uniquely determine an index for that dimension.
This is analogous to a multi-field key in a relational database
relation.  There may be several candidate sets of fields that might
serve as a key for a relation, but only one set of fields is declared to
be *the key* for the relation.  In your example above, if knowing the
value of coord2 and coord3 were enough to determine the corresponding d2
index (by the intended meaning of d2, coord2 and coord3, not just the
values in a particular dataset), then it would be sufficient to declare

      :d2 = "coord2 coord3";

> Incidentally, this seems to me to be perfectly valid, but it violates
> your requirement that:
>  
> >  No two tuples of coordinate variable values are the same for distinct
> >  values of the dimension.

I meant the *values* of the coordinate variables had to uniquely
determine the dimension index, but again I wasn't being precise enough.
But I think it is important to capture this property of a coordinate,
that the values of the coordinate uniquely determine the index of the
corresponding dimension.  This is automatically true for a
one-dimensional coordinate variable with monotonically increasing or
decreasing values, but even if the coordinate variable values could not
be ordered, I think you would agree that you would want them to be
unique.  For example, in using countries as a dimension, the character
coordinate variable had better not name the same country twice.  I was
trying to say that for coordinate variables that had multiple
components, you never want the same set of components to occur more than
once, since they must uniquely determine a dimension index.

> However the main point is that in some circumstances one may wish to consider
> data2 as a coordinate variable for data1, or vice versa. In that case, the
> global dimension attributes become:
>  
>     :d1 = "data1 data2 coord1";
>     :d2 = "data1 data2 coord2 coord3 coord4 coord5";
>     :d3 = "data1 data2 coord2 coord3 coord4 coord5";
>  
> This initially looks fine, but in fact it adds absolutely no information
> to the file, as all it does is explicitly state the dimensional dependence
> of each variable in the file - something that can already be found out by
> (perhaps somewhat tedious) inspection of each variable.

Exactly, and I agree that this would not be a desirable convention.
It's not the one I meant to propose.

> So, if we do allow data variables to be coordinates for other data variables,
> then your dimension attribute proposal adds no information at all. If we don't
> allow this, then all it really does is to identify the set of variables which
> we do consider to be coordinate variables. This could be done more clearly
> by having a single global attribute, as follows:
>  
>     :coordinate_variables = "coord1 coord2 coord3 coord4 coord5";
>  
> This is quite like our original proposal, but avoids the problem of having
> to maintain coordinate attributes for each data variable. Bindings between
> coordinate variables and data variables must then be worked out on the
> basis of which dimensions they have in common (keeping in mind that the
> dimensions of a coordinate variable must always be a subset of the dimensions
> of the associated data variable).  If this was adopted, someone should write
> and disseminate a subroutine or set of routines which identify these bindings!

I don't think these bindings can be discovered merely by examining
which dimensions data variables have in common, because I think they are
providing information about the intended meaning of the data that is not
in the declarations of which variables depend on which dimensions.
Going back to the layer example, there may be many variables that depend
on a layer dimension; stating that a combination of exactly two of these
variables uniquely determine the layer index is a way of capturing the
meaning in the data that can be used by applications.

> The main limitation of the above is that it allows less flexibility in
> the association of data variables and coordinate variables. Using our
> original proposal, for example, we could write:
>  
>     data1:coordinates = "coord1 coord2 coord3";
>     data2:coordinates = "coord1 coord2 coord3 coord4 coord5";
>  
> signifying that coord4 and coord5 were appropriate coordinates for data2,
> but not data1. The global attribute approach doesn't allow this, but that
> may not be a big sacrifice in most applications. If great flexibility really
> is required, perhaps a nested approach could be used, where the "coordinates"
> attribute for a variable is used only if it is present, and otherwise a
> global "coordinate_variables" attribute is used.

This is a good point.  But the above doesn't tell me whether a tuple of
values (coord1, coord2, coord3) really represents a single conceptual
coordinate for data2, or whether it is really a 5-dimensional variable
with no relation among its coordinates.  Perhaps this is asking too much
of conventions, but I'm hoping that if the author of a dataset knows
some relation among coordinates and variables that must be true, that
relation can somehow be represented in declarations.

> Thanks again for your comments, and we would welcome more on the material
> above. We have also copied this message to Jonathan Gregory, and have also
> had some correspondence with him on other aspects of our proposal. We have
> not sent this message to the netCDF group, due to its length, but feel free
> to forward it if you think that is appropriate.

I agree that the netcdfgroup as a whole doesn't seem very interested in
these coordinate conventions, but I think your proposal and comments are
valuable, so I'm adding your posting and this reply to the coordinate
conventions archive.  I've also added John Caron to the CC: list, since
I think he's interested and I value his insights on the subject.  If
anyone in this subgroup feels I shouldn't include their postings in the
archive for anyone else to read, please let me know.

> Regards,
> Stephen Walker
> Jason Waring

Thanks again for your comments.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
russ@unidata.ucar.edu                     http://www.unidata.ucar.edu