Re: [cf-pointobsconvention] Draft 2

John-

Thanks for getting this going.  I think you're on the right track.

John Caron wrote:
Attached is a PDF of second draft of what my thoughts are on a point ob s
convention. Its not really a proposal (yet), but a reasonable place to

start.

A couple of points: :-)

- there should be no requirement of a z dimension for point data or
trajectories.  ship tracks, hurricane tracks, etc have no z dimension.
We've already run into this problem with the existing nc2.2 trajectory
implementation.

- "profiler" in the atmospheric community is used extensively for the
name of an instrument that produces "profiles".  I would recommend
changing your "profiler" to "profile" to remove this confusion.  A
RAOB sounding is a profile, but people do not refer to the instrument
as a profiler necessarily.

- RAOB sounding observations are currently broken out based on the
type of observation so that you have multiple profiles for the same
station at the same time (mandatory levels, significant temperature
levels, significant wind levels, tropopause levels, etc).  The number
of levels of each (paricularly the significant level data) varies from
station to station and time to time.  How would that fit into this
model?  See the current netCDF sounding files on motherlode for examples.

-  while not necessarily a part of the convention, implementations
of these need easy factory methods for taking a set of point
observations and creating trajectories based on a particular variable
(e.g station name or id).  For example, if I have a collection of ship
reports for  multiple times, I'd like to be able to view this as point
observations or individual tracks for each ship.

- is there some explicit expectation that the z dimension is
convertible with altitude units (m,ft, etc).  Satellite wind
observations can have pressure as the vertical dimension so that
needs to be supported.

Don

Point Observation Data
Draft 2
09/11/07

This is a convention for writing collections of point observations to a
netCDF file. This builds on section 5 of the CF-1.0 document, replacing
section 5.4 and 5.5 with a more general convention.

A point observation is a data measurement at a specific time and
location. Each kind of measured data is placed in a data variable. The
time and location values are placed into coordinate variables and
auxiliary coordinate variables.
The starting idea, as described in section 5, is to use the coordinates
attribute to associate auxiliary coordinate variables with the data
variables. For example, consider an unconnected collection of points
where ozone has been sampled:
dimensions:
 sample = 1000 ;

variables:
 float O3(sample) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(sample) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(sample) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(sample) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(sample) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;

In this example, there are 1000 points in the collection, and we have
chosen to name the dimension sample to clarify the distinction between
collection dimensions and coordinates. The coordinates of the ith sample
are time(i), lon(i), lat(i) and z(i).
When the data is time ordered, its natural to use time as the sample
dimension:

dimensions:
 time = 1000 ;

variables:
 float O3(time) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(time) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(time) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(time) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(time) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;

Because time is now a coordinate variable, its values should be strictly
monotonic (i.e. the data is sorted by time). Formally, you no longer
need to include time in the coordinates attribute, since it is known to
be a coordinate. However, a suggested idiom is to list all coordinates
in the coordinates attribute, for clarity.

Data variables may have other dimensions. The following has a 3D wind
vector and a character array:

dimensions:
 sample = 1000;
 wind_vector = 3;
 inst_name_strlen = 23;

variables:

 float wind(sample, wind_vector);
   wind:long_name = "3D wind";
   wind:units = "m/s";
   wind:coordinates = "lon lat z time";

 char inst_name(sample, inst_name_strlen);
   inst_name:long_name = "instrument name";
   inst_name:coordinates = "lon lat z time" ;

We define profile observation data as point data that has a vertical
dimension in the data, with a constant lat/lon (or x/y) location, for
example:

dimensions:
 sample = 1000 ;

variables:
 float O3(sample, z) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(sample) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(sample) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(sample) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(sample, z) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;

In the above example each sample has the same number of z coordinates,
but (possibly) different z coordinate values, creating the 2D z
coordinate. For the case where all samples have exactly the same z
coordinate values, it is more efficient, and better to use:

 float z(z) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;


There is an important restriction on how an auxiliary coordinate
connects to the data variable: the dimensions of the auxiliary
coordinate must be a subset of the dimensions of any data variable that
uses it. So z(sample, z) and z(z) are ok as an auxiliary coordinate for
O3(sample, z), but neither could be an auxiliary coordinate for, say,
O3( time).

Time series of station data

Suppose that point data is taken at a set of named locations called
stations. The set of observations at a particular station, if ordered by
time, becomes a time series, and the file is a collection of time series
of station data. In this case one could use:

dimensions:
 station = 10 ;  // measurement locations
 pressure = 11 ; // pressure levels
 time = UNLIMITED ;

variables:
 float humidity(time, pressure, station) ;
   humidity:long_name = "specific humidity" ;
   humidity:units = "" ;
   humidity:coordinates = "lat lon pressure time" ;

 double time(time) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(station) ;
   lon:long_name = "station longitude";
   lon:units = "degrees_east";

 float lat(station) ;
   lat:long_name = "station latitude" ;
   lat:units = "degrees_north" ;

 float pressure(pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;


There are two problems with this scheme. The first is that each station
has the same number of samples (times) allocated to it. This is called a
rectangular array. When stations have different numbers of samples, one
is forced to allocate the maximum sample size, and use missing data
values. In this example, the amount of wasted data is exacerbated by
having a vertical (pressure) dimension in the data. Further, if the
pressure coordinate variable can vary, one must use:

 float pressure(time, pressure, station) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

The second problem in this example is that the coordinate values for
time are required to be the same for each set of measurements at each
station. This can be fixed, however, by using

 double time(station, time) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;


As we try to represent more complicated arrangements of point
observations, this issue of rectangular arrays often appears.

A different way to handle variable number of samples at each station is
to remove the station dimension from the data variables, and keep track
of the station index for each observation in a separate variable:
dimensions:
 station = 10 ;  // measurement locations
 pressure = 11 ; // pressure levels
 profile = UNLIMITED ;

variables:
 float humidity(profile, pressure) ;
   humidity:long_name = "specific humidity" ;
   humidity:coordinates = "lat lon pressure time" ;

 int station_index(profile) ;
   station_index:long_name = "index into station dimension";

 double time(profile) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(station) ;
   lon:long_name = "station longitude";
   lon:units = "degrees_east";

 float lat(station) ;
   lat:long_name = "station latitude" ;
   lat:units = "degrees_north" ;

If the pressure coordinate is constant, then

 float pressure(pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

If the pressure coordinate can vary for each profile:

 float pressure(profile, pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

If its fixed for each station, you’d like to use:

 float pressure(station, pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

The station_index variable associates the ith profile with the station
at index station_index(i). But lat and lon can no longer be considered
auxiliary coordinate variables, since they use a dimension that is not
present in the data variable. Instead, there is an extra level of
indirection represented by the station_index variable.  So we are really
generalizing past previous notions of coordinate variables and auxiliary
coordinate variables.

Instead of making coordinate variables more complicated, we are going to
generalize the underlying data model, using concepts from relational
databases. In addition to the fundamental data type of multidimensional
array, we add the data type table, where a table is a collection of
variables with the same outer dimension. We then define an index join as
connecting two tables using a variable in one table that holds dimension
indices into the second table. Dimension indices are zero based.

Returning to our time series of station data example, we can create a
new notation using tables. All variables with the same outer dimension,
such as:

 float humidity(profile, pressure) ;
   humidity:long_name = "specific humidity" ;
 float temperature(profile, pressure) ;
   temperature:long_name = "air temperature" ;
 float pressure(profile, pressure) ;
   pressure:long_name = "pressure" ;

 int station_index(profile) ;
 double time(profile) ;

are rewritten as:

 table {
   float humidity(pressure) ;
     humidity:long_name = "specific humidity" ;
   float temperature(pressure) ;
     temperature:long_name = "air temperature" ;
   float pressure(pressure);
     pressure:long_name = "pressure" ;

   int station_index;
   double time;

 } profile (profile);

So a “table variable” is created that uses the profile (outer)
dimension. All the variables that have that outer dimension become part
of the table. Similarly for the station table (for clarity, we stop
showing the attributes):

 table {
   float humidity(pressure) ;
   float temperature(pressure) ;
   float pressure(pressure);
   int station_index;
   double time;
 } profile(profile);

 table {
   float lon;
   float lat;
  } station (station);

To specify the index join, if we wanted to write pseudo-SQL, we could say
 JOIN profile TO station WITH profile.station_index

where profile and station specify tables with the corresponding
dimension, and station_index is a variable in the profile table whose
values are indices in the station table. In other words:

 JOIN <child dimension> TO <parent dimension> WITH <child.variable>

Of course, none of this is in the netCDF file, it’s just a short hand
notation.

Another compact and useful notation is to consider that the tables are
nested, and to ignore the mechanism by which the nesting occurs:

 table {
   float lon;
   float lat;

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
     float pressure(pressure);
   } profile (*);

  } station (station);

Here the (*) denotes a variable length dimension. All of the profiles
inside of a station table are for that station. Note that because we are
using a fixed pressure dimension, all profiles have a fixed number of
pressure levels. The values of those pressure levels can vary from
profile to profile. If the pressure levels were fixed at each station,
you would have:

 table {
   float lon;
   float lat;
   float pressure(pressure);

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
   } profile (*);

  } station (station);

If the pressure levels were fixed for all profiles:

 float pressure(pressure);
 table {
   float lon;
   float lat;

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
   } profile (*);

  } station (station);

If the number of pressure levels could vary from profile to profile, we
are back in the situation of having to set a maximum, then using missing
values. Applying the same principles as before we can create another
table, for example (using nested table notation):

 table {
   float lon;
   float lat;

   table {
     double time;

     table {
       float humidity;
       float temperature;
       float pressure
     obs(*);

   } profile (*);

  } station (station);

OR using table notation:

 table {
   float humidity ;
   float temperature;
   float pressure;
   int profile_index;
 } obs (obs);

 table {
   int station_index;
   double time;
 } profile (profile);

 table {
   float lon;
   float lat;
 } station (station);

OR using CDL:

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
 int profile_index(obs);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

As you can see, there’s a mechanical conversion between these 3
notations (CDL, tables, nested tables).
Using the Unlimited Dimension

The use of the unlimited dimension in the netcdf-3 file format warrants
attention because it can have a strong effect on performance. Consider
the following example:

dimensions:
 station = 4021 ;  // measurement locations
 pressure = 30 ; // pressure levels
 time = UNLIMITED ; // currently 117987

variables:

 float humidity(time, pressure) ;
 float temperature(time, pressure) ;
 float pressure(time, pressure) ;
 int time(time) ;
 int station_index(time) ;

 char name(station, name_strlen);
 char desc(station, desc_strlen);
 double lat(station);
 double lon(station);
 double alt(station);

All of the variables using the time dimension are called record
variables because they use the unlimited (record) dimension.
The layout of the netCDF-3 file format is simple: first the header is
written, then the non-record variables are each written, then the record
variables are written. Non-record variables are written in the order
they are defined. The entire space must be allocated for them at define
time, which is why their dimension sizes cant change. Record variables
are written one record at a time, where record 0 has all the record
variable values for index=0, then record 1 with all the record variable
values for index=1, etc. The unlimited dimension can thus grow by
appending to the file.
Since the file layout is quite different depending whether the unlimited
dimension is used, the performance of reading the data can be quite
different. In a worse case scenario, for large files, you might see a
factor of 100 performance difference, depending on your read access
pattern (the actual times are highly dependent on the caching strategy
of the underlying file system). So it is sometimes necessary to
understand what the common read pattern is and to optimize the file
layout for it.
Using the record dimension is often very useful when writing data that
arrives sequentially, since the new data can simply be appended to the
file, and you don’t need to know ahead of time how many records there
will be.

The decision to use the record dimension or not must not effect the data
type or the semantics of the data – only access efficiency.
Creating Fast Access to Children

Given a row in a child table, one finds the parent using the parent
index variable. However, one must read the entire parent index variable
to find all of the child rows for a given parent row. For efficiency,
one can optionally add a way to quickly find all of the child rows for a
given parent row, using a linked list or a contiguous list.

A contiguous list places all children in contiguous rows, and then adds
firstChild and numChildren variables in the parent table which hold
dimension indices into the child table. For the ith parent row, all its
children are found at the indices between firstChild(i) and
firstChild(i) + numChildren(i). This method is recommended as the most
efficient way to read all the child rows for a parent, since they are
stored contiguously.

A forward linked list adds a firstChild in the parent table and
nextChild variable in the child table, which hold dimension indices into
the child table. One reads the firstChild row and follows the links in
nextChild until the dimension index is less than 0, indicating the end
of the linked list. This method is recommended when writing data for
multiple parents at once, when the total number of children is unknown,
so a contiguous list is not possible.

A backwards linked list adds a lastChild in the parent table and
prevChild variable in the child table, again which hold dimension
indices into the child table. One reads the lastChild row and follows
the links in prevChild until the dimension index is less than 0,
indicating the end of the linked list. This method is recommended for
real-time data arriving serially and unpredictably, since one only has
track the last child for each parent in memory and append the new
record, then update the lastChild array when the data has all been
received. With a forward linked list, one must also rewrite the previou
record.

Remember that dimension indices are 0 based.
Specifying the type of data

The table data type and technique of connecting tables through dimension
index variables is quite general and should be useful for many kinds of
data in any domain of science.

Experience has shown that it’s important for visualization and analysis
tools and for human understanding to classify data into broad categories
based on the topology of the collection. We call these data types. We
haven't found a systematic or rigorous classification scheme; rather
these reflect our experience with observational datasets in the earth
sciences, strongly influenced by the type of measuring instruments used

While one could imagine everything as merely a collection of points, it
is usually necessary to take advantage of whatever structure is found in
the data. The structure of the data and coordinate systems ideally
reflects the connectedness (a.k.a. topology ) of the measurements. This
connectedness is not always able to be ascertained by inspecting the
structure of the coordinate systems. For example, trajectories and point
data have the same structure.

The set of data types we propose to standardize in the convention are:

    Collection of point data (unconnected x,y,z,t) Examples: earthquake data.
    Collection of trajectories (connected x,y,z,t, ordered t) Examples: 
aircraft data, drifting buoy.
    Collection of profiler data  (unconnected x,y,t, connected z) Examples: 
satellite profiles.
    Station collection of point (unconnected x,y,z, connected t) Examples: 
metars.
    Station collection of profilers (unconnected x,y; connected z, connected t) 
Examples: profilers.

These mostly fit the form (Collection | Station Collection) of (Point |
Profile | Trajectory). Others that might be needed:

    Trajectories of sounding (connected x,y,z,t, ordered z, ordered t)
Examples: ship soundings.

CDL Examples

Collection of point data

variables;
 float lon(obs);
 float lat(obs);
 float z(obs);
 double time(obs);

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
   pressure:coordinates = “lon lat z time”;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of point data”;
 :CF_datatype = “point”;
 :CF_table = “obs”;

Collection of profiler data (rectangular)

variables;
 float lon(obs);
 float lat(obs);
 float z(obs, z); // or z(z)
 double time(obs);

 float humidity(obs, z);
 float temperature(obs, z);
 float pressure(obs, z);
   pressure:coordinates = “lon lat z time”;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of profiler data”;
 :CF_datatype = “profiler”;
 :CF_table = “obs”;

Collection of Trajectories

variables;
 float lon(obs);
 float lat(obs);
 float z(obs);
 double time(obs);

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
   pressure:coordinates = “lon lat z time”;

 int trajectory_id(obs); // unneeded if only one trajectory LOOK

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory data”;
 :CF_datatype = “trajectory”;
 :CF_table = “obs”;


Collection of Trajectories of Sounding (rectangular)

variables;
 float lon(sounding);
 float lat(sounding);
 double time(sounding);
 float z(sounding, z); // or z(z)

 float humidity(sounding, z);
 float temperature(sounding, z);
 float pressure(sounding, z);
   pressure:coordinates = “lon lat z time”;

 int trajectory_index(sounding); // unneeded if only one trajectory

 char ship_name( trajectory, ship_name_strlen) ;
 char instrument( trajectory, instrument_strlen) ;


attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory of sounding data”;
 :CF_datatype = “trajectory of sounding”;
 :CF_table = “JOIN sounding TO trajectory WITH trajectory_index”;


Collection of Trajectories of Soundings (variable z)

Variables:
 float salinity(obs) ;
 float temperature(obs) ;
 float pressure(obs) ;
 double time(obs) ;
 int sounding_index(obs) ;

 float lat(sounding) ;
 float lon(sounding) ;
 int trajectory_index(sounding) ;

 char ship_name( trajectory, ship_name_strlen) ;
 char instrument( trajectory, instrument_strlen) ;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory of sounding data”;
 :CF_datatype = “trajectory of sounding”;
 :CF_table = “JOIN sounding TO trajectory WITH trajectory_index AND JOIN
obs TO sounding WITH sounding_index”;


Station Collection of Point

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);

 double time(obs);
 double station_index(obs);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Collection of point”;
 :CF_datatype = “Station”;
 :CF_table = “JOIN obs TO station WITH station_index”;

Station Collection of Profilers (fixed length)

 float humidity(profile, z);
 float temperature(profile, z);
 float pressure(profile, z);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Profilers”;
 :CF_datatype = “Station Collection of Profiler”;
 :CF_table = “JOIN profile TO station WITH station_index”;

Station Collection of Profilers (variable length)

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
 int profile_index(obs);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Profilers”;
 :CF_datatype = “Station Collection of Profiler”;
 :CF_table = “JOIN profile TO station WITH station_index AND JOIN obs TO
profile WITH profile_index”;




Still To Do:

    Decide on the mechanism by which the join is specified. Do we really want 
“pseudo-SQL” ?
    Specify the datatypes globally or ??
    What do you put the :coordinate attribute on? All data variables would 
follow existing CF. Then you have a redundant system somewhat.
    Sorting: when can you count on it being sorted? Eg time series in station 
data. Required or optional?


------------------------------------------------------------------------

_______________________________________________
cf-pointobsconvention mailing list
cf-pointobsconvention@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/

--
*************************************************************
Don Murray                               UCAR Unidata Program
dmurray@xxxxxxxxxxxxxxxx                        P.O. Box 3000
(303) 497-8628                              Boulder, CO 80307
http://www.unidata.ucar.edu/staff/donm
*************************************************************