|
|
|||
|
||||
John Caron
last changed: June 21, 2005
also see: Unidata Observation Dataset Conventions
In the NetCDF version 3 data model, record variables are ones that use the unlimited dimension (also known as the record dimension). These are laid out on disk differently than non-record variables. All of the data for a non-record variable is stored contiguously on disk. Record variables, in contrast, are divided up into records, and each record contains the data for all record variables for the ith record dimension index. You can append an unlimited number of records to a NetCDF file.
A Structure is a NetCDF Variable that contains other Variables, like a struct in C. All of the data in a Structure are stored together on disk, so it is efficient to read all the data in a Structure in a single read. Structures are a new part of NetCDF, introduced in the NetCDF-Java library, version 2.2 as part of the Common Data Model (CDM), and also implemented in the NetCDF-4 library.
Since data for Record variables is stored together on disk, this satisfies the definition of a Structure. NetCDF-3 files that use the unlimited dimension therefore can be thought of as having a Structure named record, containing the record variables. For example, the file
netcdf example {
dimensions:
time = UNLIMITED;
dim = 42;
variables:
float non-record_var1;
float non-record_var2( dim);
float record_var1(time);
float record_var2( time, dim);
}
In the enhanced CDL used by the Common Data Model, this would be:
netcdf example {
dimensions:
time = UNLIMITED;
dim = 42;
variables:
float non-record_var1;
float non-record_var2( dim);
Structure {
float record_var1;
float record_var2( dim);
} record( time);
}
Our motivation is to efficiently access all of the data in the record variables for some record, i.e. use a single read to fetch the ith record. This would be useful if your common data access pattern is "structure-oriented", eg if you iterate over time, and read all the data for all Variables for each time step before moving on to the next time step. If the common access pattern is to read all the data for one Variable over all times, then it will be more efficient to store the Variable data contiguously, i.e. make them non-record variables. Access times for these two patterns can differ by an order of magnitude or more for large files.
Another reason to use record variables is to take advantage of the ability to append an unlimited number of records. This can allow you to write out realtime data as it arrives, without knowing how many records you will need.
Note that in all these examples, you can use the record dimension or not. If writing both ways is possible, then you should determine (or guess) the common read access patterns to decide whether to use the record dimension or not.
With the NetCDF-Java 2.2 library, if your data has an unlimited dimension you can create the record structure by calling NetcdfFile.addRecordStructure() or NetcdfDataset.addRecordStructure(). This will add a Structure variable named "record" to the file; then you can use Structure.readStructure() to efficiently read all the data in each record with one call.
For Point data, we store measurements taken at various "point" locations, where the locations differ for each measurement, and are not connected to each other. For this simple case, we might have:
netcdf pointData {
dimensions:
record = UNLIMITED;
dim1 = 11;
dim2 = 4;
variables:
float latitude(record);
float longitude(record);
int elevation(record);
int time_observation(record);
float obs_data1(record) ;
float obs_data2(record, dim1);
int obs_data3(record);
int obs_data4(record, dim2);
String obs_data5(record);
...
}
In this case the measurements occur at named locations, called stations. Typically there are a number of measurements taken at each station periodically. The file then contains a collection of stations, and each station has a time series of observations.
Using nested CDM Structures, we could model this data in the following way:
netcdf stationData {
dimensions:
station = 137;
variables:
Structure {
char id(8);
char description(40);
float latitude;
float longitude;
int elevation;
int numReports;
Structure {
int time_observation;
float obs_data1 ;
float obs_data2(11);
int obs_data3;
int obs_data4(4);
String obs_data5;
...
} stationObs(*);
} station( station);
}
This describes an array of station Structures, each of which has an id, description, latitude, longitude, and elevation, as well as a nested, variable length array of stationObs Structures. The (*) means that each station can have a different length array of stationObs. (Note that in the CDM, we can use unnamed dimensions for some of the dimensions, while in netCDF-3 we have to declare all dimensions as shared.)
In NetCDF-3, we don't have the ability to store nested Structures, and we only have one real Structure to use, the record Structure. What we can do is to make the station data into a pseudo-Structure, which is collection of variables which all have the same outer dimension (it's not a real Structure because the variables are not stored contiguously). We then can use the record Structure for the station observation data, and connect them to the station in several ways.
1) We can use a linked list of record numbers:
netcdf stationData {
dimensions:
station = 137;
record = UNLIMITED;
id_len = 8;
desc_len = 40;
od2_len = 11;
od4_len = 4;
variables:
char id(station, id_len);
char description(station, desc_len);
float latitude(station);
float longitude(station);
int elevation(station);
int firstStationObs(station);
int numReports(station);
int nextStationObs(record);
int stationIndex(record);
int time_observation(record);
float obs_data1(record) ;
float obs_data2(record, od2_len);
int obs_data3(record);
int obs_data4(record, od4_len);
String obs_data5(record);
...
}
Notice that the station data variables all have the dimension station, and the stationObs data are all record variables . The firstStationObs and nextStationObs variables create a linked list of stationObs for each station. The stationIndex variable, while not strictly needed, lets you find the station from a stationObs data record. This linked list makes it easy to get all the stationObs for one station (but note that there's no efficient subsetting of that list, so its not really a variable length array). If you are writing data as it arrives, it will be easier to keep track of a backwards list, e.g. use variables lastStationObs and prevStationObs.
The advantage of the linked list is that you can have a variable number of stationObs for each station, and you don't waste any space. Its also ideal for writing files as the data arrives in random order.
2) If you have complete control over how the data is written, then another option is to store variable length data in one array in a contiguous list. In our example, then, all the observations between firstStationObs(i) and firstStationObs(i) + numReports(i) - 1 would belong to the ith ship, and we dont need the nextStationObs variable. This also allows a variable number of stationObs for each station with no wasted space, plus the contiguity of the observation records gives very efficient access for the common case of reading all the observations for one station. However you lose the ability to write the data in random order.
3) If there is a fixed number of stationObs for each station, then a good file layout is:
netcdf stationData {
dimensions:
station = UNLIMITED;
nobs = 24;
id_len = 8;
desc_len = 40;
od2_len = 11;
od4_len = 4;
variables:
char id(station, id_len);
char description(station, desc_len);
float latitude(station);
float longitude(station);
int elevation(station);
int time_observation(station, nobs);
float obs_data1(station, nobs) ;
float obs_data2(station, nobs, od2_len);
int obs_data3(station, nobs);
int obs_data4(station, nobs, od4_len);
String obs_data5(station, nobs);
...
}
Here we use the station dimension as the unlimited dimension, in order to group all the data for one station together on disk. This simple multidimensional structure may be the best solution when the number of observations for each station is constant.
Trajectory data looks just like point data, except the points are assumed to be connected. The case where there is a single trajectory in the file therefore looks just like the Point Data case.
If you want to store multiple trajectories in the same file, however, then the file looks like station data, since you have to distinguish which trajectory the record belongs to. Again, you can use linked lists, contiguous lists, or multidimensional structures. Here is an example for the linked list case:
netcdf trajectoryData {
dimensions:
trajectory = 11;
record = UNLIMITED;
variables:
int trajectory(trajectory); // some kind of id
int firstObs(trajectory);
int numObs(trajectory);
int nextObs(record);
int trajectoryIndex(record);
int time_observation(record);
float latitude(record);
float longitude(record);
int depth(record);
float obs_data1(record);
int obs_data2(record);
int obs_data3(record);
...
}
Contiguous lists look just like linked lists except that you dont need the nextObs variable to store the link, and of course, you have to store the observations contiguously.
There are other data measurements where you might need another level of nested structures. For example, a collection of ship trajectories, with variable length sounding data at points along the trajectory:
netcdf soundingData {
dimensions:
ship = 137;
variables:
Structure {
char id(8);
char description(40);
Structure {
float latitude;
float longitude;
int time_observation;
float obs_data1 ;
float obs_data2(11);
...
Structure {
int depth;
int obs_data3;
int obs_data4(4);
String obs_data5;
...
} observation(*)
} sounding(*);
} ship( ship);
}
So this file has a number of ship tracks (a kind of trajectory), each containing a variable number of soundings, and each sounding consists of a variable number of observations, which is a collection of measurements all taken at the same location.
If there are always the same number of observations for each sounding (or there is a maximum number and you don't mind wasting some space), you might use the record structure for the soundings, and create a contiguous list to connect them to the ship:
netcdf soundingData {
dimensions:
ship = 137;
observation = 24;
record = UNLIMITED;
variables:
char ship(ship, id_len);
char description(ship, desc_len);
int firstSounding(ship);
int numSoundings(ship);
int shipIndex( record);
float latitude(record);
float longitude(record);
float obs_data1(record) ;
float obs_data2(record, extra_dim1);
float depth(record, observation);
int time(record);
int obs_data3(record, observation);
int obs_data4(record, observation, extra_dim2);
String obs_data5(record, observation);
...
}
This puts all the data for one sounding (including the observation data) in a single record. For full generality, we assume that the depths of the soundings vary; if they were always the same, you might want to factor them out into a variable like float depth( observation).
For a variable number of observations per sounding, you could use the record structure for the observations:
netcdf soundingData {
dimensions:
ship = 137;
sounding = 4700;
record = UNLIMITED;
variables:
char ship(ship, id_len);
char description(ship, desc_len);
int firstSounding(ship);
int numSoundings(ship);
int shipIndex(sounding);
float latitude(sounding);
float longitude(sounding);
float obs_data1(sounding) ;
float obs_data2(sounding, extra_dim1);
int firstObservation(sounding);
int numObservations(sounding);
...
int soundingIndex( record);
int time(record);
float depth(record);
int obs_data3(record);
int obs_data4(record, extra_dim2);
...
}
So we have a contiguous list of soundings for each ship, and a contiguous list of observations for each sounding. We are not wasting any space on the observations, but we may be wasting space on allocating more sounding structures than we need, if we dont know up front how many there are. Since these are contiguous lists, all of the observation data for one sounding are stored together on disk. If we use linked lists, the data for one sounding could be scattered across the file. Generally, contiguity gives more efficient access for the common case of reading all the data in a Structure.
If there is only one ship trajectory stored in the file, things get a bit simpler:
netcdf soundingData {
dimensions:
sounding = 1200;
record = UNLIMITED;
variables:
char ship_name(id_len);
char description(desc_len);
float latitude(sounding);
float longitude(sounding);
int time_observation(sounding);
float obs_data1(sounding) ;
float obs_data2(sounding, extra_dim1);
int firstObservation(sounding);
int numObservations(sounding);
...
int soundingIndex( record);
int time(record);
float depth( record);
int obs_data3(record);
int obs_data4(record, extra_dim2);
...
}
The use of record variables has been demonstrated with various examples for point, station, trajectory, and sounding data. We can use the generality of the CDM to create idealized representations of complicated data structures. It then becomes easier to see how to map those data structures back into ones that are representable in NetCDF-3 files. Grouping variables into Structures and pseudo-Structures is a useful conceptual tool.
There are a number of tradeoffs when deciding on a NetCDF-3 file structure, depending upon how much variable length data you have, as well as expected data access patterns. Linked lists are best for writing data arriving in random order, contiguous lists give more efficient read access for ordered data, and multidimensional structures work well when there are a fixed number of records. In optimizing, you must take into account the file layouts of record vs. non-record variables.
Its not easy to infer the type of data (e.g. station, trajectory, etc) that a file contains by examining its data structures. We have seen some examples here where different data types might use identical data structures. We recommend explicitly describing the file's data type, especially its connectivity and its coordinate system(s), in the file metadata and in a human-readable Conventions document. See the Unidata Observation Dataset Conventions for detailed recommendations.
| Contact Us Site Map Search Terms and Conditions Privacy Policy Participation Policy | ||||||
|
||||||