CDM implementation of CF discrete sampling features
These notes refer to the current release of CDM 4.3. Section headers reference the CF 1.6 doc at:
Short guide to writing files using CF-1.6 Discrete Sampling Features Conventions
Step 1: Requirements
- At minimum, you must have a (lat, lon) location and a date / time coordinate for all your data. Some types also require a vertical coordinate (e.g. altitude, depth).
Step 2: Which feature type?
What kind of data do you have:
- Point (unconnected) data: data is located at different, unconnected locations. Examples: earthquake data, lightning data.
- Time series (station) data: data is located at named locations, called stations. There can be many stations, and usually for each station you have multiple data with different time coordinates. Stations have a unique identifier. Examples: weather station data, fixed buoys.
- Profile data: A series of connected observations along a vertical line. Each profile has only one lat, lon coordinate (possibly nominal), so that the points along the profile differ only in z coordinate and possibly time coordinate. There can be multiple profiles in the same file, and each profile has a unique identifier If you have many profiles with the same lat, lon location, use the Time series Profile type. Examples: atmospheric profiles from satellites, moving profilers.
- Time series (station) Profile data: Profile data at fixed locations. This is a combination of Time series type and Profile type, so one has time series of Profiles at fixed locations. A file can contain many stations and many time series at each station. Examples: profilers, balloon soundings.
- Trajectory data: A series of connected observations along a 1D curve in time and space. There can be multiple trajectories in the same file, each with a unique identifier. Examples: aircraft data, drifting buoys.
- Trajectory of Profiles: a collection of profile features which originate along a trajectory. So these are trajectories which have profile data (varying with z) at each (lat, lon) location. Examples: ship soundings.
Step 3: Which representation should I use?
- Point (unconnected) data: use CF H.1
- Time series (station) data:
- You will put only one station in the file: you may use CF H.2.3
- All stations have the exact same time coordinates: you may use CF
- Each station has almost the same number of time coordinates and the values may be different: you may use CF H.2.2
- Each station has different number of coordinates and you want to keep file size as small as possible:
- you have all the data already, and you want to optimize reading all the data for one station: you may use H.2.4
- you want to write the data as it arrives, in any order: you may use H.2.5
- Profile data:
- You will put only one profile in the file: you may use CF H.3.3
- All profiles have the exact same vertical coordinates: you may use CF H.3.1
- Each profile has almost the same number of vertical coordinates and the values may be different: you may use CF H.3.2
- Each profile has a different number of vertical coordinates and you want to keep file size as small as possible:
- you have all the data already, and you want to optimize reading all the data for one profile: you may use H.3.4
- you want to write the data as it arrives, in any order: you may use H.3.5
- Time series (station) Profile data:
- You will put only one station in the file: you may use CF H.5.2
- Each station has the same number of profiles, and the same number of vertical levels for each profile : you may use CF H.5.1
- Each station has a different number of profiles and/or the level coordinates for each station may vary : you may use CF H.5.3
- Trajectory data:
- You will put only one trajectory in the file: you may use CF H.4.2
- All trajectories have the same or almost the same number of points: you may use CF H.4.1
- Each trajectory has different number of points and you want to keep file size as small as possible:
- you have all the data already, and you want to optimize reading all the data for one trajectory: you may use H.4.3
- you want to write the data as it arrives, in any order: you may use H.4.4
- Trajectory of Profiles:
- You will put only one trajectory in the file: you may use CF H.6.2
- All trajectories have the same or almost the same number of points: you may use CF H.6.1
- Each trajectory has different number of profiles and/or vertical coordinates:
you may use H.6.3
Step 4: Test a sample file
Write a test file with some sample data in it.
Test in ToolsUI 4.3 (webstart link)
- Go to Feature Types / Point Features Tab
- On the right is a combobox with the feature type choices. Select the correct type (or leave ANY_POINT to let the CDM figure it out).
- Enter the file name or navigate with the FileChooser
- If it works, the various tables will be filled in - click on a table row to navigate the feature heirarchy. The context menu (right-click) on the large table lets you show the data.
- This widget is pretty clumsy (sorry - we will improve as soon as the MacArthur money arrives). For point data you have to click the "Get ALL Data" button above the right navigation pane.
- This essentially tests the CDM ability to read your file. The CDM is a bit more lenient than CF in some ways, and a bit more restrictive in others (see below for details).
- The CF compliance checkers may also help diagnose issues.
Miscellaneous questions and advice
- Should I use the unlimited dimension? This can have a huge impact on performance for large files, because it affects the data layout on disk. The answer is: it depends.
- If you have lots of variables at each observation, and you want to optimize the case of getting one or a few variables at all the points, then don't use the unlimited dimension. This is called column oriented storage.
- If you want to optimize the case of getting all or most of the variables at each point, then use the unlimited dimension. This is called row oriented storage.
- For important, long-lived archives, you should test the performance of each case using the read access pattern that you want to optimize.
- If you don't know, then my prejudice is to use the unlimited dimension. For small datasets (<10 M ?) it is probably not that important.
- Should I use coordinate variables or auxiliary coordinate variables?
- A coordinate variable is 1D, and has the same name as its dimension, e.g. float time(time). The coordinate values must be monotonically increasing or decreasing. There can be no missing values. Use a coordinate variable if those conditions are true.
- An auxiliary coordinate variable may have missing values, and is not required to have monotonic, or even unique values. If that's the situation, you must use an auxiliary coordinate, e.g. float time(sample).
- What's the reason to include ids for things like trajectories or profiles?
- The "instance" ids allow software like the CDM to efficiently fetch just the data for a named feature, using the id.
- How big should I make my files? How should I divide the data between files?
- If you have the choice, a fewer number of large files is better than zillions of small files. I would shoot for files in the range 50M - 2 Gbytes.
- More important is to divide your files into distinct time ranges, called time partitioned files. This is a natural way to divide earth science data. It allows the CDM to serve many files as a single dataset using CDM feature collections. For time partitioned files, if possible, put the partitioning date in the filename.
- Why should I bother to do all this extra work?
- If you are publicly funded, you should make your data as accessible to others as possible. This is the minimum "extra work" your peers think is needed for them to be able to use your data. And they sincerely thank you!
Differences from CF
9.1 Limits on coordinate types
- CF: "In Table 9.1 the spatial coordinates x and y typically refer to longitude and latitude but other horizontal coordinates could also be used (see sections 4 and 5.6) "
- CDM: only latitude and longitude are supported.
- CDM: vertical coordinate may be height or pressure. Dimensionless Vertical Coordinates are not supported.
9.3 Limits on dimension ordering
- CF: "In the multidimensional array representations, data variables have both an instance dimension and an element dimension. The dimensions may be given in any order"
- CDM: the instance dimension must be the outer (slowest varying) dimension
9.4 Attribute featureType is required
- CF: "A global attribute, featureType, is required for all Discrete Geometry representations except the orthogonal multidimensional array representation, for which it is highly recommended".
- CDM: The global attribute featureType is always required. Acceptable aliases are CF:featureType and CF:feature_type .
9.5 Feature instance id variable is required
- CF: "Where feasible a variable with the attribute cf_role should be included. The only acceptable values of cf_role for Discrete Geometry CF data sets are timeseries_id, profile_id, and trajectory_id. The variable carrying the cf_role attribute may have any data type. When a variable is assigned this attribute, it must provide a unique identifier for each feature instance."
- CDM: A variable representing the instance id is required, indicated by an attribute named cf_role, which follows all the CF rules above.
Notes on representations
In all cases, latitude, longitude, altitude and time coordinates must be recognized in the usual CF way. The altitude coordinate is optional in some of the forms.
H.1 Point Data
In the CDM, point data is recognized by the featureType = "point" global attribute. The altitude coordinate is optional. All coordinates must have the same dimension, called the obs or sample dimension. All variables with the obs dimension as outer dimension are data variables.
H.2 Time Series Data
In the CDM, this form is recognized by the featureType = "timeSeries" global attribute. The altitude coordinate is optional.
Special station variables are recognized by standard names as given below. For backwards compatibility, the given aliases are allowed.
H.2.1 / H.2.2 Multidimensional Time Series Representation
The lat, lon and altitude coordinates must have the same dimension, called the station or instance dimension. All variables with the station dimension as outer dimension are station variables. The time dimension must be of the form time(time) or time(station, time), where the time dimension is the obs or sample dimension. All data variables must have the form data(station, time).
For compatibility with earlier versions
- ragged_row_count is an alias for sample_dimension standard name
- ragged_row_index is an alias for feature_dimension standard name
- all attributes can optionally be prefixed by "CF:"
H.2.3. Single time series, including deviations from a nominal fixed spatial location
The CDM uses the axis attribute to choose the correct coordinate. However, it provides no special handling for the precise coordinates.
H.2.4. Contiguous ragged array representation of time series
H.3.5. Indexed ragged array representation of profiles
Example only shows double time(profile) but double time(obs) is also possible, when the observation varies by time.
H.5.1. Multidimensional array representations of time series profiles
Specificaton says "The pressure(i,p,o), temperature(i,p,o), and humidity(i,p,o) data for element o of profile p at station i are associated with the coordinate values time(i,p), z(i,p,o), lat(i), and lon(i). Any of the three dimensions could be the netCDF unlimited dimension, if it might be useful to be able enlarge it."
Since CDM currently only allows dimensions to be in the order (station, profile, z), then only the station dimension could be unlimited in the multidimensional representation.
This document is maintained by John Caron and was last updated April 2011