[cf-pointobsconvention] [Fwd: Re: [CF-metadata] Seeking example program for storing surface obs in CF?convention]

Below is an earlier conversation with Jonathan Gregory that would be good to 
review.

-------- Original Message --------
Subject: Re: [CF-metadata] Seeking example program for storing surface obs in   
CF?convention
Date: Thu, 09 Aug 2007 11:11:06 -0600
From: John Caron <caron@xxxxxxxxxxxxxxxx>
To: Jonathan Gregory <j.m.gregory@xxxxxxxxxxxxx>
CC: cf-metadata@xxxxxxxxxxxx
References: <20070808080651.GA11219@xxxxxxxxxxxxxxxxx>

Hi Jonathan:

Thanks for taking the time to look at this. Comments are inline.

Jonathan Gregory wrote:
Dear John

My own opinion is that CF is not currently adequate for writing observational data to NetCDF. The basic limitation in section 5.4 is that float humidity(time,pressure,station)
  float pressure(pressure);
  double time(time);

requires the same number and values of the time and pressure coordinates at each station.

Yes, this is wasteful of space if you make all the stations share the
coordinate variables but they don't all have info at all (time,pressure)
points. Alternatively you have to create separate coordinate variables for
each station, which may be inconvenient.

If we put them in common variables, if I have understood your proposal, I
prefer the contiguous arrangement, something like this:

dimensions:
  record=UNLIMITED;
  station=5;
  stringlen;
variables:
  char station_name(station,stringlen);
  float latitude(station);
  float longitude(station);
  double time(record);
  float humidity(record);
    humidity:coordinates="time";
  float temperature(record);
    temperature:coordinates="time";

where the individual stations are contiguous in the humidity and temperature
variables. Then the question is how to indicate the range of records which
belongs to each station. One way, as in your example, is to provide an array
of start or end pointers into the records. Another way, which takes up a bit
more space but could be more convenient for using the data, would be to include

  int whichstation(record);
    whichstation:coordinate_index="station";

where the presence of the coordinate_index attribute indicates that the value
of whichstation is an index into the station coordinate dimension. whichstation
could be identified an an auxiliary coordinate variable by naming it in the
coordinates attribute:

  float humidity(record);
    humidity:coordinates="time whichstation";

E.g. if you have two timeseries, one with temperature data (1.1, 1.2, 1.3) and
the other with data (2.1, 2.2), you would have:

data:
  temperature=1.1, 1.2, 1.3, 2.1, 2.2;
  whichstation=0, 0, 0, 1, 1;

If it is done this way, rather than with start pointers, the individual
timeseries actually do not have to be stored contiguously, so any of them can
be appended to at any time. That might be a useful feature.

Yes, I think its a good alternative to just have each record refer to its owning station, and not have to maintain the links. The parent/child linked (and contiguous array) variant is useful to make finding the data fast; otherwise you have to read through all of the data when you want to find data for one (or a small subset) of stations.

The reference to the station could either be by index or by name, in our typical files of this type, a few bytes wont matter much.


Your proposal appears to me to introduce several extra features which are
redundant or duplicating other CF attributes. The _CoordinateAxisType attr
has the same function as the CF axis attribute. I don't see the need for the
global attributes latitude_coordinate etc. since the lat etc. coordinates can be identified by units and by standard_name; also, having a *global* attr
restricts the file to having only *one* coord variable of each type. The
attributes giving the max and min of each of the coordinates contain info
which can be deduced from the coord variables themselves, of course; is that
an important kind of discovery metadata? I'd be worried about it because it
is almost certain to be wrong some of the time i.e. inconsistent with the
coord variables. The cdm_datatype attribute implies a distinction between
various kinds of data which are formally not really different and would be
processed in the same way, so I don't see why this is useful.

The Convention wasnt intended to be a proposal for CF, just a stand-alone Convention for this type of data, so we were making it rather broad to cover several existing data formats. So there is likely to be some redundancy and I guess the next step is to decide which parts should be added to CF.

The _CoordinateAxisType enumeration is intended to be a complete listing of georeferencing axis types. We use them instead of parsing the units, looking for "positive", looking for standard names, and the other ways of identifying coordinate axes that have evolved out of COARDS/CF. They are for sure redundant to all of that.

The min/max values are a kind of discovery metadata. We also use them to tell the user what are the possible valid space/time queries on this dataset. Again, this is an optimization for reading/serving data that obviates having to read through the entire file.

The cdm_datatype reflects our experience in how to describe kinds of data ("scientific data types"). This has been a long and ongoing evolution of our understanding. For example the coordinate system for a "time series of point data" looks just like "trajectory" data, so we use the cdm_datatype to disambiguate. It essentially describes the connectivity of the points. Its needed by visualizers, and useful for discovery.

Our "Observation Convention" introduces the notion of grouping variables into "Structures" by specifying that all variables with a common outer dimension are part of the structure. This works especially well for the record dimension, where the variables really are a Structure (that is, all record variables are stored contiguously for record 0, then record 1, etc). Its also useful for non-record dimensions, eg all variables whose outer dimension is "station" comprise the "Station Structure".

Anyway, it would be great to get some other heads onto this, especially those who have written or need to write this kind of point observation data. If we can get 3 or 4 interested parties, we could put together a real proposal for CF.

Thanks again, Jonathon!

John
_______________________________________________
CF-metadata mailing list
CF-metadata@xxxxxxxxxxxx
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata