|
|
|||
|
||||
Multiple NetCDF files can be aggregated into a single, logical NetCDF dataset. This is done with the aggregation NcML element. There are several types of aggregation:
See also: Annotated NcML Schema
The following NcML constructs a dataset by creating the union of two netCDF files (note there is no location attribute in the outer netcdf element).
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation type="union">
<netcdf location="C:/test/path/rh.nc"/>
<netcdf location="C:/test/path/temp.nc"/>
</aggregation>
</netcdf>
Supposing that the nested files look like:
netcdf rh.nc {
dimensions:
time = 12;
lat = 64;
lon = 128;
variables:
float rh(time, lat, lon);
}
netcdf temp.nc {
dimensions:
time = 12;
lat = 64;
lon = 128;
variables:
float Temperature(time, lat, lon);
}
Then the aggregation dataset looks like:
netcdf TestUnion.ncml {
dimensions:
time = 12;
lat = 64;
lon = 128;
variables:
float rh(time, lat, lon);
float Temperature(time, lat, lon);
}
The rh variable will be read from C:/test/path/rh.nc, while the Temperature variable will be read from C:/test/path/temp.nc. One can also specify the referenced dataset reletive to the working path or reletive to the Ncml, see Dataset URLs.
A Union dataset is constructed by transferring objects (dimensions, attributes, groups, and variables) from the nested datasets in the order the nested datasets are listed. If an object with the same name already exists, it is skipped. You need to pay close attention to dimensions and coordinate variables, which must match exactly across nested files.
Example file : aggUnionSimple.ncml
The following NcML constructs a dataset by declaring "time" an aggregation dimension. Any variable that has that dimension as its outer dimension is an aggregation variable. The dimension must be the outer (slowest varying) dimension. There must be an existing coordinate variable named time.
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinExisting"> <netcdf location="file:/test/temperature/jan.nc" /> <netcdf location="file:/test/temperature/feb.nc" /> </aggregation>
</netcdf>
The variable T is an aggregation variable with the first data values taken from the file jan.nc and the next data values are taken from feb.nc. The dataset CDL will look like this to the client:
netcdf aggExisting.ncml {
dimensions:
lat = 64;
lon = 128;
time = 59;
variables:
float T(time, lat, lon);
}
When the library opens the above NcML dataset, it has to read through all nested datasets, in order to find out the length of the time dimension. For large aggregations, this can be slow. In the example below, we have added the optional ncoords attribute on the nested datasets. In this case, only one dataset has to be opened immediatley, and the others as needed for a data read request.
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinExisting"> <netcdf location="file:/test/temperature/jan.nc" ncoords="31"/> <netcdf location="file:/test/temperature/feb.nc" ncoords="28"/> </aggregation>
</netcdf>
A JoinExisting dataset is constructed by transferring objects (dimensions, attributes, groups, and variables) from the nested datasets in the order the nested datasets are listed. All variables that use the aggregation dimension are logically concatenated, in the order of the nested datasets. Variables that don't use the aggregation dimension are treated as in a Union dataset, i.e. skipped if one with that name already exists.
Example file : aggExisting.ncml
Typically the coordinates for a JoinExisting aggregation are taken from the existing coordinate variables, as in the above example. If it is missing, you must define it in the NcML:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
(1)<variable name="time" shape="time" type="int">
<attribute name="units" value="days since 2000-01-01"/>
<attribute name="_CoordinateAxisType" value="Time" />
(2) <values start="0" increment="1" />
</variable>
<aggregation dimName="time" type="joinExisting">
<netcdf location="file:/test/temperature/jan.nc" ncoords="31"/>
<netcdf location="file:/test/temperature/feb.nc" ncoords="28"/>
</aggregation>
</netcdf>
These are the ways that coordinate values are assigned to a JoinExisting coordinate:
<aggregation dimName="time" type="joinExisting"> <netcdf location="file:/test/temperature/janAvgWeek.nc" coordValue="1038 7823 12983 43400"/> <netcdf location="file:/test/temperature/febAvgWeek.nc" coordValue="66234 89237 108736 123494"/> </aggregation>
The previous example "joined" variables along their existing outer dimension. Another common case is to aggregate variables by creating a new outer dimension. Each existing variable becomes one "slice" of the compound variable (a slice holds the index of one dimension constant, e.g. humidity(3, *, *, *)). The following NcML joins variables from three separate files into a single variable, by creating a new dimension of length 3:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
(1)<aggregation dimName="time" type="joinNew">
(2) <variableAgg name="T"/>
<variableAgg name="RH"/>
(3) <netcdf location="file:/test/data/ncml/nc/time0.nc" coordValue="0"/>
<netcdf location="file:/test/data/ncml/nc/time1.nc" coordValue="1"/>
<netcdf location="file:/test/data/ncml/nc/time2.nc" coordValue="2"/>
</aggregation>
</netcdf>
This will create the following dataset:
netcdf aggSynthetic.xml {
dimensions:
time = 3; // (has coord.var)
lat = 3; // (has coord.var)
lon = 4; // (has coord.var)
variables:
double time(time);
float lat(lat);
:units = "degrees_north";
float lon(lon);
:units = "degrees_east";
double T(time, lat, lon);
:long_name = "surface temperature";
:units = "degC";
:title = "Example Data - Type 1 aggregation"; }
A JoinNew dataset is constructed by transferring objects (dimensions, attributes, groups, and variables) from the nested datasets in the order the nested datasets are listed. All variables that are listed as aggregation variables are logically concatenated along a new dimension, in the order of the nested datasets. A coordinate Variable is created for the new dimension. Non-aggregation variables are treated as in a Union dataset, i.e. skipped if one of that name already exists.
Example file : aggSynthetic.ncml
A JoinNew aggregation has to create a new coordinate variable. In the above example, one was automatically created with type double, to match the coordValues specified on the netcdf elements. However, it has no units or other attributes. To specify attibutes on the coordinate system, you can use the following:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
(1)<variable name="time">
<attribute name="units" value="months since 2000-6-16 6:00"/>
<attribute name="_CoordinateAxisType" value="Time" />
<values>0 1 2</values>
</variable>
(2)<aggregation dimName="time" type="joinNew">
<variableAgg name="T"/>
<netcdf location="file:/test/data/ncml/nc/time0.nc" />
<netcdf location="file:/test/data/ncml/nc/time1.nc" />
<netcdf location="file:/test/data/ncml/nc/time2.nc" />
</aggregation>
</netcdf
Its not obvious from the NcML, but the aggregation element (2) is processed first, so that all of the objects of the aggregated datasets are available to be modified by other NcML elements, for example by (1).
There are 4 ways that coordinate values are assigned to a JoinNew coordinate:
Note that you must explicitly specify the coordinate variable in order to assign attributes to it, which is something you are likely to need to do, for example defining a units attribute is usually necessary. Assigning the _CoordinateAxisType type is a recommended way to make sure that the nj22 Coordinate layer correctly identifies the coordinate type.
Also note that, contrary to previous versions of NcML aggregation, you do not need to define a dimension element for the aggregation dimension (e.g. <dimension name="time"> and must not use the old form <dimension name="time" length="0" /> as it will override the dimension created by the aggregation.
For all aggregations, the aggregation element is processed first, so that the objects (dimensions, attributes, groups, and variables) from the nested datasets exist and can be modified by other NcML elements.
Its often convenient to indicate that all the files in some directory should be aggregated. This can be done for any type. The following example scans all of the files in the directory /data/model (and its subdirectories) which end in ".nc". By default, the files are ordered by sorting on the filename.
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinExisting">
<scan location="/data/model/" suffix=".nc" />
</aggregation>
</netcdf>
When opening a joinExisting aggregation using a scan element, each matching file must be opened in order to determine its size. This can be slow if there are a large number of files. In the case where you specifiy the files individually, you could add the ncoords attribute for speed. In the THREDDS Data Server, the information is cached, so that subsequent requests do not need to open each file until data is requested. However, see the section on caching.
A joinNew type aggregation does not incur this expense, since there is always exactly one step per file:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinNew">
<variableAgg name="T"/>
<scan location="/data/goes/" suffix=".gini" />
</aggregation>
</netcdf>
Here the problem is how to assign coordinate values to each step? If you do nothing, a String-valued coordinate variable will be defined, whose values are the filenames. If you know the number of files, you can specifiy the coordinate variable yourself and assign it values:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<dimension name="time" length="48"/>
<variable name="time" type="int" shape="time">
<attribute name="units" value="hours since 2000-01-01 00:00"/>
<attribute name="_CoordinateAxisType" value="Time" />
<values start="0" increment="1" npts="48" />
</variable>
<aggregation dimName="time" type="joinNew">
<variableAgg name="T"/>
<scan location="/data/goes/" suffix=".gini" />
</aggregation>
</netcdf>
You can also explicitly list the values:
<values>12.0 25.2 37.9 77.12</values>
For the common case that you want to assign a date/time coordinate by parsing the filename, you can use the dateFormatMark attribute:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinNew">
<variableAgg name="T"/>
<scan location="/data/goes/" suffix=".gini" dateFormatMark="SUPER-NATIONAL_1km_SFC-T_#yyyyMMdd_HHmm" />
</aggregation>
</netcdf>
The dateFormatMark attribute is used on joinNew aggregation (or joinExisting if there is only one time slice in each file) to create date coordinate values out of the filename. It consists of a section of text, a '#' marking character, then a java.text.SimpleDataFormat string. The number of characters before the # is skipped in the filename, then the next part of the filename must match the SimpleDataFormat string, then it ignores any trailing text. For example:
Filename: SUPER-NATIONAL_1km_SFC-T_20051206_2300.gini DateFormatMark: SUPER-NATIONAL_1km_SFC-T_#yyyyMMdd_HHmm
The net effect is to add a coordinate variable, whose values are ISO 8601 formatted date/time Strings, with a _CoordinateAxisType of "Time", for example:
<variable name="time" type="String" shape="time">
<attribute name="_CoordinateAxisType" value="Time" />
<values>2005-11-28T21:00:00Z 2005-11-28T21:15:00Z 2005-11-28T21:30:00Z</values>
</variable>
The scan element allows you to specify that all of the files in a directory (and its subdirectories, with an optional suffix filter) are included in the aggregation. The files are sorted alphabetically on the filename, unless you specify a dateFormatMark attribute, in which case they are sorted by the Date derived from the filename, which is also used for the coordinate values.
When you use a scan element to define a collection of files, the case where the set of files may change as new files are added or deleted requires special attention.
There are situations where you need to indicate how often the directories should be rescanned.
You indicate how often the directories should be rescanned using the recheckEvery attribute:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinNew" recheckEvery="15 min" >
<variableAgg name="T"/>
<scan location="/data/goes/" suffix=".gini" />
</aggregation>
</netcdf>
The value of recheckEvery must be a udunit time unit, e.g. uses units of sec, min, hour, day, etc. If you do not specify a recheckEvery attribute, the collection will be assumed to be non-changing.
When using the scan element on directories whose contents may change, you must use a recheckEvery attribute. It effectively specifies the maximum time before changes will be detected by a newly opened NcML dataset. An existing NcML dataset will not notice the changes, and you can get FileNotFoundException if the component files are deleted.
For large collection of files, one wants to avoid opening every single file each time the dataset is accessed. Instead we only want to open the files that are actually needed to fulfill a data request. Generally this is straightforward, except for discovering the number and values of the aggregation coordinate variable for type joinExisting. This is because we have to know the size of the aggregation dimension when we open the dataset, even before we read any data. For practical purposes, we often need to know the coordinate values immediately also.
To help solve this problem, you should enable Aggregation Caching in your application, by telling the ucar.nc2.ncml.Aggregation class where it can cache information, by calling the static method (see javadoc for more details):
// Enable Aggregation caching. Every hour, delete stuff older than 30 days
Aggregation.setPersistenceCache( new DiskCache2("/.nj22/cachePersist", true, 60 * 24 * 30, 60));
When this is enabled, joinExisting aggregations will save information to special XML files in the specified directory, in order to avoid opening every file to obtain its coordinate values, each time the dataset is opened. Instead, the first time it is opened, the values are read, then subsequent opens will use the cached values.
If using a scan element on changing directories, be sure to specify the recheckEvery attribute to make sure that the cached information gets updated.
One can nest netcdf elements in aggregation, for example:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinExisting">
<netcdf>
<aggregation type="union">
<netcdf location="file:C:/test/path/temperature_20080101.nc" />
<netcdf location="file:C:/test/path/salinity_20080101.nc" />
</aggregation>
</netcdf>
<netcdf>
<aggregation type="union">
<netcdf location="file:C:/test/path/temperature_20080102.nc" />
<netcdf location="file:C:/test/path/salinity_20080102.nc" />
</aggregation>
</netcdf>
</aggregation>
</netcdf>
Next: FMRC Aggregation
This document is maintained by John Caron and was last updated on June 16, 2008
| Contact Us Site Map Search Terms and Conditions Privacy Policy Participation Policy | ||||||
|
||||||