Re: [cf-pointobsconvention] Draft 2

A few comments:

1) The profile observation data CDL needs a "z" dimension:

z = 35 ;

2) For the "Creating Fast Access to Children" contiguous list indices, you should change it to "firstChild(i) and firstChild(i) + numChildren(i) -1" or specify that it is an inclusive/exclusive range.

3) In the list of data types, I don't find the definitions very clear. I think the definitions would benefit from starting with definitions of the base types and working up to the various types of collections for those base types. I think the problem for me is that the connectedness doesn't seem clear when we are talking about collections of the base types. Maybe:

Base types:
- Point data: a set of parameters measured at a time and location (fixed x, y, z, t) - Profile: an ordered set of data points along a vertical line at a single instant in time (fixed x, y, t; connected in z). - Trajectory: an ordered set of data points along a curve in time and space (connected in x, y, z, t; ordered in t).

Collections of base types:
- Collection of point data: a set of data points distributed in space and time (unconnected in x, y, z, t)
- Collection of profile data: What is the connectedness between profiles?
   - the profiles are not connected (point collection of profiles?)
- the profiles are all for the same location/station and ordered in time (a time series collection of profiles?) - the profiles are from many stations but all for the same time (station collection of profiles?) - Collection of trajectory data: Same here. What is the connectedness between trajectories?

Ethan

John Caron wrote:
Attached is a PDF of second draft of what my thoughts are on a point obs convention. Its not really a proposal (yet), but a reasonable place to start.

---

Point Observation Data
Draft 2
09/11/07

This is a convention for writing collections of point observations to a netCDF file. This builds on section 5 of the CF-1.0 document, replacing section 5.4 and 5.5 with a more general convention.

A point observation is a data measurement at a specific time and location. Each kind of measured data is placed in a data variable. The time and location values are placed into coordinate variables and auxiliary coordinate variables. The starting idea, as described in section 5, is to use the coordinates attribute to associate auxiliary coordinate variables with the data variables. For example, consider an unconnected collection of points where ozone has been sampled:
dimensions:
 sample = 1000 ;

variables:
 float O3(sample) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(sample) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(sample) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(sample) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(sample) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;

In this example, there are 1000 points in the collection, and we have chosen to name the dimension sample to clarify the distinction between collection dimensions and coordinates. The coordinates of the ith sample are time(i), lon(i), lat(i) and z(i). When the data is time ordered, its natural to use time as the sample dimension:

dimensions:
 time = 1000 ;

variables:
 float O3(time) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(time) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(time) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(time) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(time) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;

Because time is now a coordinate variable, its values should be strictly monotonic (i.e. the data is sorted by time). Formally, you no longer need to include time in the coordinates attribute, since it is known to be a coordinate. However, a suggested idiom is to list all coordinates in the coordinates attribute, for clarity.

Data variables may have other dimensions. The following has a 3D wind vector and a character array:

dimensions:
 sample = 1000;
 wind_vector = 3;
 inst_name_strlen = 23;

variables:

 float wind(sample, wind_vector);
   wind:long_name = "3D wind";
   wind:units = "m/s";
   wind:coordinates = "lon lat z time";

 char inst_name(sample, inst_name_strlen);
   inst_name:long_name = "instrument name";
   inst_name:coordinates = "lon lat z time" ;

We define profile observation data as point data that has a vertical dimension in the data, with a constant lat/lon (or x/y) location, for example:

dimensions:
 sample = 1000 ;

variables:
 float O3(sample, z) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(sample) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(sample) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(sample) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(sample, z) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;

In the above example each sample has the same number of z coordinates, but (possibly) different z coordinate values, creating the 2D z coordinate. For the case where all samples have exactly the same z coordinate values, it is more efficient, and better to use:

 float z(z) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;


There is an important restriction on how an auxiliary coordinate connects to the data variable: the dimensions of the auxiliary coordinate must be a subset of the dimensions of any data variable that uses it. So z(sample, z) and z(z) are ok as an auxiliary coordinate for O3(sample, z), but neither could be an auxiliary coordinate for, say, O3( time).

Time series of station data

Suppose that point data is taken at a set of named locations called stations. The set of observations at a particular station, if ordered by time, becomes a time series, and the file is a collection of time series of station data. In this case one could use:

dimensions:
 station = 10 ;  // measurement locations
 pressure = 11 ; // pressure levels
 time = UNLIMITED ;

variables:
 float humidity(time, pressure, station) ;
   humidity:long_name = "specific humidity" ;
   humidity:units = "" ;
   humidity:coordinates = "lat lon pressure time" ;

 double time(time) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(station) ;
   lon:long_name = "station longitude";
   lon:units = "degrees_east";

 float lat(station) ;
   lat:long_name = "station latitude" ;
   lat:units = "degrees_north" ;

 float pressure(pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;


There are two problems with this scheme. The first is that each station has the same number of samples (times) allocated to it. This is called a rectangular array. When stations have different numbers of samples, one is forced to allocate the maximum sample size, and use missing data values. In this example, the amount of wasted data is exacerbated by having a vertical (pressure) dimension in the data. Further, if the pressure coordinate variable can vary, one must use:

 float pressure(time, pressure, station) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

The second problem in this example is that the coordinate values for time are required to be the same for each set of measurements at each station. This can be fixed, however, by using

 double time(station, time) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;


As we try to represent more complicated arrangements of point observations, this issue of rectangular arrays often appears.

A different way to handle variable number of samples at each station is to remove the station dimension from the data variables, and keep track of the station index for each observation in a separate variable:
dimensions:
 station = 10 ;  // measurement locations
 pressure = 11 ; // pressure levels
 profile = UNLIMITED ;

variables:
 float humidity(profile, pressure) ;
   humidity:long_name = "specific humidity" ;
   humidity:coordinates = "lat lon pressure time" ;

 int station_index(profile) ;
   station_index:long_name = "index into station dimension";

 double time(profile) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(station) ;
   lon:long_name = "station longitude";
   lon:units = "degrees_east";

 float lat(station) ;
   lat:long_name = "station latitude" ;
   lat:units = "degrees_north" ;

If the pressure coordinate is constant, then

 float pressure(pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

If the pressure coordinate can vary for each profile:

 float pressure(profile, pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

If its fixed for each station, you’d like to use:

 float pressure(station, pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

The station_index variable associates the ith profile with the station at index station_index(i). But lat and lon can no longer be considered auxiliary coordinate variables, since they use a dimension that is not present in the data variable. Instead, there is an extra level of indirection represented by the station_index variable. So we are really generalizing past previous notions of coordinate variables and auxiliary coordinate variables.

Instead of making coordinate variables more complicated, we are going to generalize the underlying data model, using concepts from relational databases. In addition to the fundamental data type of multidimensional array, we add the data type table, where a table is a collection of variables with the same outer dimension. We then define an index join as connecting two tables using a variable in one table that holds dimension indices into the second table. Dimension indices are zero based.

Returning to our time series of station data example, we can create a new notation using tables. All variables with the same outer dimension, such as:

 float humidity(profile, pressure) ;
   humidity:long_name = "specific humidity" ;
 float temperature(profile, pressure) ;
   temperature:long_name = "air temperature" ;
 float pressure(profile, pressure) ;
   pressure:long_name = "pressure" ;

 int station_index(profile) ;
 double time(profile) ;

are rewritten as:

 table {
   float humidity(pressure) ;
     humidity:long_name = "specific humidity" ;
   float temperature(pressure) ;
     temperature:long_name = "air temperature" ;
   float pressure(pressure);
     pressure:long_name = "pressure" ;

   int station_index;
   double time;

 } profile (profile);

So a “table variable” is created that uses the profile (outer) dimension. All the variables that have that outer dimension become part of the table. Similarly for the station table (for clarity, we stop showing the attributes):

 table {
   float humidity(pressure) ;
   float temperature(pressure) ;
   float pressure(pressure);
   int station_index;
   double time;
 } profile(profile);

 table {
   float lon;
   float lat;
  } station (station);

To specify the index join, if we wanted to write pseudo-SQL, we could say
 JOIN profile TO station WITH profile.station_index

where profile and station specify tables with the corresponding dimension, and station_index is a variable in the profile table whose values are indices in the station table. In other words:

 JOIN <child dimension> TO <parent dimension> WITH <child.variable>

Of course, none of this is in the netCDF file, it’s just a short hand notation.

Another compact and useful notation is to consider that the tables are nested, and to ignore the mechanism by which the nesting occurs:

 table {
   float lon;
   float lat;

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
     float pressure(pressure);
   } profile (*);

  } station (station);

Here the (*) denotes a variable length dimension. All of the profiles inside of a station table are for that station. Note that because we are using a fixed pressure dimension, all profiles have a fixed number of pressure levels. The values of those pressure levels can vary from profile to profile. If the pressure levels were fixed at each station, you would have:

 table {
   float lon;
   float lat;
   float pressure(pressure);

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
   } profile (*);

  } station (station);

If the pressure levels were fixed for all profiles:

 float pressure(pressure);
 table {
   float lon;
   float lat;

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
   } profile (*);

  } station (station);

If the number of pressure levels could vary from profile to profile, we are back in the situation of having to set a maximum, then using missing values. Applying the same principles as before we can create another table, for example (using nested table notation):

 table {
   float lon;
   float lat;

   table {
     double time;

     table {
       float humidity;
       float temperature;
       float pressure
     obs(*);

   } profile (*);

  } station (station);

OR using table notation:

 table {
   float humidity ;
   float temperature;
   float pressure;
   int profile_index;
 } obs (obs);

 table {
   int station_index;
   double time;
 } profile (profile);

 table {
   float lon;
   float lat;
 } station (station);

OR using CDL:

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
 int profile_index(obs);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

As you can see, there’s a mechanical conversion between these 3 notations (CDL, tables, nested tables).
Using the Unlimited Dimension

The use of the unlimited dimension in the netcdf-3 file format warrants attention because it can have a strong effect on performance. Consider the following example:

dimensions:
 station = 4021 ;  // measurement locations
 pressure = 30 ; // pressure levels
 time = UNLIMITED ; // currently 117987

variables:

 float humidity(time, pressure) ;
 float temperature(time, pressure) ;
 float pressure(time, pressure) ;
 int time(time) ;
 int station_index(time) ;

 char name(station, name_strlen);
 char desc(station, desc_strlen);
 double lat(station);
 double lon(station);
 double alt(station);

All of the variables using the time dimension are called record variables because they use the unlimited (record) dimension. The layout of the netCDF-3 file format is simple: first the header is written, then the non-record variables are each written, then the record variables are written. Non-record variables are written in the order they are defined. The entire space must be allocated for them at define time, which is why their dimension sizes cant change. Record variables are written one record at a time, where record 0 has all the record variable values for index=0, then record 1 with all the record variable values for index=1, etc. The unlimited dimension can thus grow by appending to the file. Since the file layout is quite different depending whether the unlimited dimension is used, the performance of reading the data can be quite different. In a worse case scenario, for large files, you might see a factor of 100 performance difference, depending on your read access pattern (the actual times are highly dependent on the caching strategy of the underlying file system). So it is sometimes necessary to understand what the common read pattern is and to optimize the file layout for it. Using the record dimension is often very useful when writing data that arrives sequentially, since the new data can simply be appended to the file, and you don’t need to know ahead of time how many records there will be.

The decision to use the record dimension or not must not effect the data type or the semantics of the data – only access efficiency.
Creating Fast Access to Children

Given a row in a child table, one finds the parent using the parent index variable. However, one must read the entire parent index variable to find all of the child rows for a given parent row. For efficiency, one can optionally add a way to quickly find all of the child rows for a given parent row, using a linked list or a contiguous list.

A contiguous list places all children in contiguous rows, and then adds firstChild and numChildren variables in the parent table which hold dimension indices into the child table. For the ith parent row, all its children are found at the indices between firstChild(i) and firstChild(i) + numChildren(i). This method is recommended as the most efficient way to read all the child rows for a parent, since they are stored contiguously.

A forward linked list adds a firstChild in the parent table and nextChild variable in the child table, which hold dimension indices into the child table. One reads the firstChild row and follows the links in nextChild until the dimension index is less than 0, indicating the end of the linked list. This method is recommended when writing data for multiple parents at once, when the total number of children is unknown, so a contiguous list is not possible.

A backwards linked list adds a lastChild in the parent table and prevChild variable in the child table, again which hold dimension indices into the child table. One reads the lastChild row and follows the links in prevChild until the dimension index is less than 0, indicating the end of the linked list. This method is recommended for real-time data arriving serially and unpredictably, since one only has track the last child for each parent in memory and append the new record, then update the lastChild array when the data has all been received. With a forward linked list, one must also rewrite the previous record.

Remember that dimension indices are 0 based.
Specifying the type of data

The table data type and technique of connecting tables through dimension index variables is quite general and should be useful for many kinds of data in any domain of science.

Experience has shown that it’s important for visualization and analysis tools and for human understanding to classify data into broad categories based on the topology of the collection. We call these data types. We haven’t found a systematic or rigorous classification scheme; rather these reflect our experience with observational datasets in the earth sciences, strongly influenced by the type of measuring instruments used.

While one could imagine everything as merely a collection of points, it is usually necessary to take advantage of whatever structure is found in the data. The structure of the data and coordinate systems ideally reflects the connectedness (a.k.a. topology ) of the measurements. This connectedness is not always able to be ascertained by inspecting the structure of the coordinate systems. For example, trajectories and point data have the same structure.

The set of data types we propose to standardize in the convention are:

• Collection of point data (unconnected x,y,z,t) Examples: earthquake data. • Collection of trajectories (connected x,y,z,t, ordered t) Examples: aircraft data, drifting buoy. • Collection of profiler data (unconnected x,y,t, connected z) Examples: satellite profiles. • Station collection of point (unconnected x,y,z, connected t) Examples: metars. • Station collection of profilers (unconnected x,y; connected z, connected t) Examples: profilers.

These mostly fit the form (Collection | Station Collection) of (Point | Profile | Trajectory). Others that might be needed:

• Trajectories of sounding (connected x,y,z,t, ordered z, ordered t) Examples: ship soundings.

CDL Examples

Collection of point data

variables;
 float lon(obs);
 float lat(obs);
 float z(obs);
 double time(obs);

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
   pressure:coordinates = “lon lat z time”;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of point data”;
 :CF_datatype = “point”;
 :CF_table = “obs”;

Collection of profiler data (rectangular)

variables;
 float lon(obs);
 float lat(obs);
 float z(obs, z); // or z(z)
 double time(obs);

 float humidity(obs, z);
 float temperature(obs, z);
 float pressure(obs, z);
   pressure:coordinates = “lon lat z time”;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of profiler data”;
 :CF_datatype = “profiler”;
 :CF_table = “obs”;

Collection of Trajectories

variables;
 float lon(obs);
 float lat(obs);
 float z(obs);
 double time(obs);

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
   pressure:coordinates = “lon lat z time”;

 int trajectory_id(obs); // unneeded if only one trajectory LOOK

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory data”;
 :CF_datatype = “trajectory”;
 :CF_table = “obs”;


Collection of Trajectories of Sounding (rectangular)

variables;
 float lon(sounding);
 float lat(sounding);
 double time(sounding);
 float z(sounding, z); // or z(z)

 float humidity(sounding, z);
 float temperature(sounding, z);
 float pressure(sounding, z);
   pressure:coordinates = “lon lat z time”;

 int trajectory_index(sounding); // unneeded if only one trajectory

 char ship_name( trajectory, ship_name_strlen) ;
 char instrument( trajectory, instrument_strlen) ;


attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory of sounding data”;
 :CF_datatype = “trajectory of sounding”;
 :CF_table = “JOIN sounding TO trajectory WITH trajectory_index”;


Collection of Trajectories of Soundings (variable z)

Variables:
 float salinity(obs) ;
 float temperature(obs) ;
 float pressure(obs) ;
 double time(obs) ;
 int sounding_index(obs) ;

 float lat(sounding) ;
 float lon(sounding) ;
 int trajectory_index(sounding) ;

 char ship_name( trajectory, ship_name_strlen) ;
 char instrument( trajectory, instrument_strlen) ;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory of sounding data”;
 :CF_datatype = “trajectory of sounding”;
:CF_table = “JOIN sounding TO trajectory WITH trajectory_index AND JOIN obs TO sounding WITH sounding_index”;


Station Collection of Point

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);

 double time(obs);
 double station_index(obs);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Collection of point”;
 :CF_datatype = “Station”;
 :CF_table = “JOIN obs TO station WITH station_index”;

Station Collection of Profilers (fixed length)

 float humidity(profile, z);
 float temperature(profile, z);
 float pressure(profile, z);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Profilers”;
 :CF_datatype = “Station Collection of Profiler”;
 :CF_table = “JOIN profile TO station WITH station_index”;

Station Collection of Profilers (variable length)

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
 int profile_index(obs);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Profilers”;
 :CF_datatype = “Station Collection of Profiler”;
:CF_table = “JOIN profile TO station WITH station_index AND JOIN obs TO profile WITH profile_index”;




Still To Do:

• Decide on the mechanism by which the join is specified. Do we really want “pseudo-SQL” ?
•    Specify the datatypes globally or ??
• What do you put the :coordinate attribute on? All data variables would follow existing CF. Then you have a redundant system somewhat. • Sorting: when can you count on it being sorted? Eg time series in station data. Required or optional?
------------------------------------------------------------------------

_______________________________________________
cf-pointobsconvention mailing list
cf-pointobsconvention@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit: http://www.unidata.ucar.edu/mailing_lists/

--
Ethan R. Davis                                Telephone: (303) 497-8155
Software Engineer                             Fax:       (303) 497-8690
UCAR Unidata Program Center                   E-mail:    edavis@xxxxxxxx
P.O. Box 3000
Boulder, CO  80307-3000                       http://www.unidata.ucar.edu/
---------------------------------------------------------------------------