Below is an earlier conversation with Jonathan Gregory that would be good to
-------- Original Message --------
Subject: Re: [CF-metadata] Seeking example program for storing surface obs in
Date: Thu, 09 Aug 2007 11:11:06 -0600
From: John Caron <caron@xxxxxxxxxxxxxxxx>
To: Jonathan Gregory <j.m.gregory@xxxxxxxxxxxxx>
Thanks for taking the time to look at this. Comments are inline.
Jonathan Gregory wrote:
My own opinion is that CF is not currently adequate for writing observational
data to NetCDF. The basic limitation in section 5.4 is that
requires the same number and values of the time and pressure coordinates at each station.
Yes, this is wasteful of space if you make all the stations share the
coordinate variables but they don't all have info at all (time,pressure)
points. Alternatively you have to create separate coordinate variables for
each station, which may be inconvenient.
If we put them in common variables, if I have understood your proposal, I
prefer the contiguous arrangement, something like this:
where the individual stations are contiguous in the humidity and temperature
variables. Then the question is how to indicate the range of records which
belongs to each station. One way, as in your example, is to provide an array
of start or end pointers into the records. Another way, which takes up a bit
more space but could be more convenient for using the data, would be to include
where the presence of the coordinate_index attribute indicates that the value
of whichstation is an index into the station coordinate dimension. whichstation
could be identified an an auxiliary coordinate variable by naming it in the
E.g. if you have two timeseries, one with temperature data (1.1, 1.2, 1.3) and
the other with data (2.1, 2.2), you would have:
temperature=1.1, 1.2, 1.3, 2.1, 2.2;
whichstation=0, 0, 0, 1, 1;
If it is done this way, rather than with start pointers, the individual
timeseries actually do not have to be stored contiguously, so any of them can
be appended to at any time. That might be a useful feature.
Yes, I think its a good alternative to just have each record refer to its owning
station, and not have to maintain the links. The parent/child linked (and
contiguous array) variant is useful to make finding the data fast; otherwise you
have to read through all of the data when you want to find data for one (or a
small subset) of stations.
The reference to the station could either be by index or by name, in our typical
files of this type, a few bytes wont matter much.
Your proposal appears to me to introduce several extra features which are
redundant or duplicating other CF attributes. The _CoordinateAxisType attr
has the same function as the CF axis attribute. I don't see the need for the
global attributes latitude_coordinate etc. since the lat etc. coordinates can
be identified by units and by standard_name; also, having a *global* attr
restricts the file to having only *one* coord variable of each type. The
attributes giving the max and min of each of the coordinates contain info
which can be deduced from the coord variables themselves, of course; is that
an important kind of discovery metadata? I'd be worried about it because it
is almost certain to be wrong some of the time i.e. inconsistent with the
coord variables. The cdm_datatype attribute implies a distinction between
various kinds of data which are formally not really different and would be
processed in the same way, so I don't see why this is useful.
The Convention wasnt intended to be a proposal for CF, just a stand-alone
Convention for this type of data, so we were making it rather broad to
cover several existing data formats. So there is likely to be some redundancy
and I guess the next step is to decide which parts should be added to CF.
The _CoordinateAxisType enumeration is intended to be a complete listing of
georeferencing axis types. We use them instead of parsing the units, looking
for "positive", looking for standard names, and the other ways of identifying
coordinate axes that have evolved out of COARDS/CF. They are for sure redundant
to all of that.
The min/max values are a kind of discovery metadata. We also use them to tell
the user what are the possible valid space/time queries on this dataset. Again,
this is an optimization for reading/serving data that obviates having to read
through the entire file.
The cdm_datatype reflects our experience in how to describe kinds of data
("scientific data types"). This has been a long and ongoing evolution of our
understanding. For example the coordinate system for a "time series of point
data" looks just like "trajectory" data, so we use the cdm_datatype to
disambiguate. It essentially describes the connectivity of the points. Its
needed by visualizers, and useful for discovery.
Our "Observation Convention" introduces the notion of grouping variables into
"Structures" by specifying that all variables with a common outer dimension
are part of the structure. This works especially well for the record dimension,
where the variables really are a Structure (that is, all record variables are
stored contiguously for record 0, then record 1, etc). Its also useful for
non-record dimensions, eg all variables whose outer dimension is "station"
comprise the "Station Structure".
Anyway, it would be great to get some other heads onto this, especially those
who have written or need to write this kind of point observation data. If we
can get 3 or 4 interested parties, we could put together a real proposal for CF.
Thanks again, Jonathon!
CF-metadata mailing list