Configuring TDS with the featureCollection element


Overview

The featureCollection element is a way to tell the TDS to serve collections of CDM Feature Datasets. Currently this is used for gridded and point datasets whose time and spatial coordinates are recognized by the CDM software stack. This allows the TDS to automatically create logical datasets composed of collections of files, and to allow subsetting in coordinate space on them, eg through the WMS, WCS, and Netcdf Subsetting Service.

Feature Collections have been undergoing continual development and refinement in the recent version of the TDS, and as you upgrade there are (mostly) minor changes to configuration and usage. The featureCollection element was first introduced TDS 4.2, replacing the fmrcDataset element in earlier versions. TDS 4.2 allowed featureType = FMRC, Point, and Station. TDS 4.3 added featureType = GRIB, used for collections of GRIB files. TDS 4.5 changed this usage to featureType = GRIB1 or GRIB2. Currently, one should only serve GRIB files with featureCollection=GRIB1 or GRIB2. One should not use FMRC, or NcML Aggregations on GRIB files.

A fair amount of the complexity of feature collections is managing the collection of files on the server, both in creating indexes for performance, and in managing collections that change. For high-performance servers, it is necessary to let a background process manage indexing, and the THREDDS Data Manager (TDM) is now available for that purpose.

Contents:

This document gives an overview of Feature Collections, as well as a complete syntax of allowed elements. For Feature Type specific information, see:

Also see:


Example catalog elements

The featureCollection element is a subtype of dataset element. It defines a logical dataset for the TDS. All of the elements that can be used inside of a dataset element can be used inside of a featureCollection element.

Example 1: Simple case using defaults:

1) <featureCollection name="NCEP Polar Sterographic" featureType="GRIB2" path="grib/NCEP/NAM/Polar_90km">
2)   <collection name="NCEP-NAM-Polar_90km" spec="/data/ldm/pub/native/grid/NCEP/NAM/Polar_90km/NAM_Polar_90km_.*\.grib2$"/>
   </featureCollection>
  1. A GRIB2 Feature Collection dataset is defined, with the "human readable" name of "NCEP Polar Sterographic". Its URL path(s) will look like http://server/thredds/<service>/grib/NCEP/NAM/Polar_90km/... The Dataset ID is automatically set to the path, so that its dataset page will be http://server/thredds/catalog/grib/NCEP/NAM/Polar_90km/catalog.xml?dataset=grib/NCEP/NAM/Polar_90km/...
  2. Defines the files in the collection as any files in the directory /data/ldm/pub/native/grid/NCEP/NAM/Polar_90km/ which match the regular expression "NAM_Polar_90km_.*\.grib2$" . In this case, it means any filename starting with "NAM_Polar_90km_" and ending with ".grib2". The collection name is "NCEP-NAM-Polar_90km", which is used for index file names etc.

Example 2: More fully specify the options explicitly:

<featureCollection name="NCEP NAM Alaska(11km)" featureType="GRIB2" path="grib/NCEP/NAM/Alaska_11km">
  <metadata inherited="true">
1) <serviceName>GribServices</serviceName> 2) <documentation type="summary">NCEP GFS Model : AWIPS 230 (G) Grid. Global Lat/Lon grid</documentation>
</metadata> 3)<collection spec="/data/ldm/pub/native/grid/NCEP/NAM/Alaska_11km/.*grib2$" name="NAM_Alaska_11km" 4) dateFormatMark="#NAM_Alaska_11km_#yyyyMMdd_HHmm" 5) timePartition="file" 6) olderThan="5 min"/> 7)<update startup="nocheck" trigger="allow"/> 8)<tdm rewrite="test" rescan="0 0/15 * * * ? *" /> 9)<gribConfig datasetTypes="TwoD Best Latest" /> </featureCollection>
  1. Arbitrary metatdata can be added to the catalog. Here, we indicate to use the service called "GribServices" (not shown, but likely a compound service that includes all the services you want to provide for GRIB Feature Collections).
  2. A documention element of type "summary" is added to the catalog for this dataset.
  3. The collection consists of all files ending with "grib2" in the directory "/data/ldm/pub/native/grid/NCEP/NAM/Alaska_11km/".
  4. A date will be extracted from the filename, and the files will then be sorted by date. Important if the lexigraphic ordering is different that the date order.
  5. Partitioning will happen at the file level.
  6. Only include files whose lastModified date is more than 5 minutes old. This is to exclude files that are actively being created.
  7. Instruct the TDS to use the collection index if it already exists, without testing if its up-to-date, and also to allow external triggers. These are the defaults.
  8. Instruct the TDM to examine all the files to detect if they have changed since the index was written. Rescan every 15 minutes.
  9. (GRIB specific) Show the TwoD and Best datasets, as well as a link to the latest partition.

Description of elements in TDS Configuration catalogs

featureCollection element

A featureCollection is a kind of dataset element, and so can contain the same elements and attributes of that element. Following is the XML Schema definition for the featureCollection element:

  <xsd:element name="featureCollection" substitutionGroup="dataset">
<xsd:complexType>
<xsd:complexContent>
<xsd:extension base="DatasetType">
<xsd:sequence>
<xsd:element type="collectionType" name="collection"/>
<xsd:element type="updateType" name="update" minOccurs="0"/>
<xsd:element type="tdmType" name="tdm" minOccurs="0"/>
<xsd:element type="protoDatasetType" name="protoDataset" minOccurs="0"/> <xsd:element type="fmrcConfigType" name="fmrcConfig" minOccurs="0"/>
<xsd:element type="pointConfigType" name="pointConfig" minOccurs="0"/>
<xsd:element type="gribConfigType" name="gribConfig" minOccurs="0"/>
<xsd:element ref="ncml:netcdf" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="featureType" type="featureTypeChoice" use="required"/>
<xsd:attribute name="path" type="xsd:string" use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
Here is an example featureCollection as you might put it into a TDS catalog:
1)<featureCollection name="Metar Station Data" harvest="true" featureType="Station" path="nws/metar/ncdecoded">
2) <metadata inherited="true">
    <serviceName>fullServices</serviceName>
<documentation type="summary">Metars: hourly surface weather observations</documentation> <documentation xlink:href="http://metar.noaa.gov/" xlink:title="NWS/NOAA information"/> <keyword>metar</keyword> <keyword>surface observations</keyword> </metadata> 3) <collection name="metars" spec="/data/ldm/pub/decoded/netcdf/surface/metar/Surface_METAR_#yyyyMMdd_HHmm#.nc$" /> 4) <update startup="test" rescan="0 0/15 * * * ? *"/> 5) <protoDataset choice="Penultimate" /> 6) <pointConfig datasetTypes="cdmrFeature Files"/> 7) <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<attribute name="Conventions" value="CF-1.6"/>
</netcdf> </featureCollection>
  1. A featureCollection is declared, using the name and harvest attributes declared by the dataset element. The featureType is a mandatory attribute defining the type of the feature collection. The path is also required, which defines what the URL of this collection will be. It must be unique over the entire TDS. If an ID attribute is not specified on the featureCollection, the path attribute is used as the ID (this is a recommended idiom).
  2. As is usual with dataset elements, a block of metadata can be declared that will be inherited by all the datasets.
  3. The collection of files is defined. Each dataset is assigned a nominal time by extracting a date from the filename.
  4. Specify that the collection is updated, when the TDS starts and in a background thread, every 15 minutes.
  5. The prototype dataset is the next-to-last in the collection when sorted by time.
  6. Configuration specific to the Point feature type: expose a cdmrRemote service on the entire collection, and also serve all the component files using the default service, in this example the compound service fullServices.
  7. This NcML wraps each dataset in the collection. This attribute overrides any existing one in the datasets; it tells the CDM to parse the station information using the CF Conventions.

collection element

A collection element defines the collection of datasets. Example:

<collection spec="/data/ldm/pub/native/satellite/3.9/WEST-CONUS_4km/WEST-CONUS_4km_3.9_.*gini$"
            dateFormatMark="#WEST-CONUS_4km_3.9_#yyyyMMdd_HHmm"
            name="WEST-CONUS_4km" olderThan="15 min" />
The XML Schema for the collection element:
  <xsd:complexType name="collectionType">
1) <xsd:attribute name="spec" type="xsd:string" use="required"/>
2) <xsd:attribute name="name" type="xsd:token"/>
3) <xsd:attribute name="olderThan" type="xsd:string" />
4) <xsd:attribute name="dateFormatMark" type="xsd:string"/>
5) <xsd:attribute name="timePartition" type="xsd:string"/>
</xsd:complexType>

where

  1. spec (required): collection specification string. In this example, the collection contains all files in the directory /data/ldm/pub/native/satellite/3.9/WEST-CONUS_4km/ whose filename matches the regular expression "WEST-CONUS_4km_3.9_.*gini$" (where ".*" means "match any number of characters" and "gini$" means "ends with the characters gini". If you wanted to match ".gini", you would need to escape the ".", ie "\.gini$").
  2. name (required): the collection name, which must be unique for all collections served by your TDS. This is used for external triggers, for the CDM collection index files, and for logging and debugging messages. If missing, the name attribute on the <featureCollection> element is used. However, we recommend that you create a unique, immutable name for the dataset collection, and put it in this name attribute of the collection element.
  3. olderThan (optional): Only files whose lastModified date is older than this are included. This is used to exclude files that are in the process of being written. However, it only applies to newly found files; that is, once a file is in the collection it is not removed because it was updated.
  4. dateFormatMark (optional): This defines a DateExtractor, which is applied to each file in the collection to assign it a date, which is used for sorting, getting the latest file, and possibly for time partitioning. In this example, the string WEST-CONUS_4km_3.9_ is located in each file path, then the SimpleDateFormat template yyyyMMdd_HHmm is applied to the next characters of the filename to create a date. A DateExtractor can also be defined in the collection specification string, but in that case the date must be contained just in the file name, as opposed to the complete file path which includes all of the parent directory names. Use this OR a date extractor in the specification string, but not both.
  5. timePartition (optional): Currently only used by GRIB collections, see here for more info.

Date Extractor

Feature Collections sometimes (Point, FMRC (ususally), and time partitioned GRIB) need to know how to sort the collection of files, and in those cases you need to have a date in the filename, and to specify a date extractor in the specification string or include a dateFormatMark attribute.

1. If the date is in the filename only, you can use the collection specification string, aka a spec:

 /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/GFS_Alaska_191km_#yyyyMMdd_HHmm#\.nc$ 

applied to the file /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/GFS_Alaska_191km_20111226_1200.grib1 would extract the date 2011-11-26T12:00:00.

In this case, #yyyyMMdd_HHmm# is positional: it counts the charactors before the '#' and then extracts the charactors in the filename (here at position 17 though 30) and applies the SimpleDateFormat yyyyMMdd_HHmm pattern to them.

2. When the date is in the directory name and not completely in the filename, you must use the dateFormatMark. For example with a file path

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/20111226/Run_1200.grib1

Use

dateFormatMark="#Alaska_191km/#yyyyMMdd'/Run_'HHmm"

In this case, the '#' characters delineate the substring match on the entire pathname. Immediately following the match comes the string to be given to SimpleDateFormat, in this example:

yyyyMMdd'/Run_'HHmm

Note that the /Run_ is enclosed in single quotes. This tells SimpleDateFormat to interpret these characters literally, and they must match characters in the filename exactly.

You might also need to put the SimpleDateFormat before the substring match, eg in the following, stuff differs for each subdirectory, so you can't match on it:

/dataroot/stuff/20111226/Experiment-02387347.grib1

However, you can match on Experiment so you can use:

dateFormatMark="yyyyMMdd#/Experiment#"

Note that whatever you match on must be unique in the pathname.

protoDataset element (Not used by GRIB).

Provides control over the choice of the prototype dataset for the collection. The prototype dataset is used to populate the metadata for the feature collection. Example:

<protoDataset choice="Penultimate" change="0 2 3 * * ? *">
  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<attribute name="featureType" value="timeSeries"/> </netcdf> </protoDataset>
The XML Schema definition for the protoDataset element:
 <xsd:complexType name="protoDatasetType">
   <xsd:sequence>
1)   <xsd:element ref="ncml:netcdf" minOccurs="0"/>
   </xsd:sequence>
2) <xsd:attribute name="choice" type="protoChoices"/>
3) <xsd:attribute name="change" type="xsd:string"/>
 </xsd:complexType>

where:

  1. ncml:netcdf = (optional) ncml elements that modify the prototype dataset
  2. choice= [First | Random | Penultimate | Latest] : select prototype from a time ordered list, using the first, a randomly selected one, the next to last, or the last dataset in the list. The default is "Penultimate".
  3. change= "cron expr" (optional). On rolling datsets, you need to change the prototype periodically, otherwise it will get deleted eventually. This attribute specifies when the protoDataset should be reselected, using a cron expression.
    • change = "0 2 3 * * ? *" means every day at 3.02 am.
    • if not specified, the prototype dataset is not changed, except when restarting the TDS

The choice of the protoDataset matters when the datasets are not homogenous:

  1. Global and variable attributes are taken from the prototype dataset.
  2. If a variable appears in the prototype dataset, it will appear in the feature collection dataset. If it doesnt appear in other datasets, it will have missing data for those times.
  3. If a variable does not appears in the prototype dataset, it will not appear in the feature collection dataset, even if it appears in other datasets.

update element

For collections that change, the update element provides options to update the collection, either synchronously (while a user request waits) or asynchronously (in a background task, so that requests do not wait). If there is no update element, then the dataset is considered static, and the indexes are never updated by the TDS. (To force updated indices, delete the collection index, usually <collection root directory> / <dataset name>.ncx.). Examples:

<update startup="test" rescan="0 0/30 * * * ? *" trigger="false"/>

<update recheckAfter="15 min" />

<update startup="never" trigger="allow" />
  1. The first example says to test if the dataset has been updated when the TDS starts up, then test in a background process every 30 minutes. (Cannot use for GRIB collections, see tdm element below). Do not allow external triggers.
  2. The second example says to test if the dataset has been updated only when a request comes in for it, and the dataset hasn't been checked for 15 minutes.
  3. The third example tells the TDS to never update the collection indices, but to allow an external program (such as the TDM) to send a trigger telling the TDS that it should reread the collection into memory. This is useful for large collections of data where even testing if a dataset has changed can be costly.

The XML Schema definition for the update element:

<xsd:complexType name="updateType">
1) <xsd:attribute name="startup" type="xsd:token"/>
2) <xsd:attribute name="recheckAfter" type="xsd:string" /> 3) <xsd:attribute name="rescan" type="xsd:token"/>
4) <xsd:attribute name="trigger" type="xsd:token"/>
</xsd:complexType>

where:

  1. startup: [never | nocheck | testIndexOnly | test | always] The collection is always read in on server startup. This attribute controls whether the collection index is tested and rebuilt.
    • For GRIB:
      • If "never", the collection index is always used and must exist. Use this for very large collections that you dont want to inadvertently scan.
      • If "nocheck", the collection index is used if it exists, without checking whether its up-to-date. This is the default.
      • If "testIndexOnly", the collection index is used if it exists and it is newer than all of its immediate children. (experimental)
      • If "test" or "true", the collection is scanned and the new collection of children is compared to the old collection. If there are any changes the index is rebuilt.
      • If "always", the collection is always rescanned and the indices are rebuilt.
    • For FMRC:
      • If "test" or "true", the collection is scanned and the new collection of children is compared to the old collection. If there are any changes the index is rebuilt.
      • If "nocheck", the collection index is used if it exists, without checking whether its up-to-date. This is the default.
  2. recheckAfter: This will cause a new scan whenever a request comes in and this much time has elapsed since the last scan. The request will wait until the scan is finished and a new collection is built (if needed), and so is called synchronous updating. This option will be ignored if you are using the rescan attribute or if you have a tdm element.
  3. rescan: uses a cron expression to specify when the collection should be rescanned in a background task. This is called asynchronous updating.
  4. trigger: if set to "allow" (default), then external triggering will be allowed. This allows collections to be updated by an external program (or person using a browser) sending an explicit "trigger" URL to the server. This URL is protected by HTTPS, so you must enable triggers for this to work. Set this to "false" to disable triggering.

For GRIB collections, dynamic updating of the collection by the TDS is no longer supported (use the TDM for this). Therefore recheckAfter and rescan are ignored on an update element for a GRIB collection.

tdm element (GRIB only)

You must use the tdm element for GRIB collections that change. The TDM is a seperate process that uses the same configuration catalogs as the TDS, and updates GRIB collections in the background. Example:

<tdm rewrite="test" rescan="0 4,19,34,49 * * * ? *"  />

The XML Schema definition for the tdm element:

<xsd:complexType name="updateType">
1) <xsd:attribute name="rewrite" type="xsd:token"/>
2) <xsd:attribute name="rescan" type="xsd:token"/>
</xsd:complexType>

where:

  1. rewrite: [test | always] If "always", the collection index is always rebuilt. If "test", the collection is scanned and a new index is built if the collection has changed.
  2. rescan: uses a cron expression to specify when the collection should be rescanned.

Enabling Triggers

  1. You must give the role "tdsTrigger" to any user who you want to have the right to send a trigger.
  2. You can see a list of the Feature Collection datasets (and manually trigger a rescan) on the page https://server:port/thredds/admin/debug?Collections/showCollection
  3. The URL for the actual trigger is https://server:port/thredds/admin/collection/trigger?collection=name&trigger=type, where name is the collection name, and type is a collectionUpdateType (see update element above). This does a rescan, and updates if anything has changed.
  4. The TDM uses the trigger https://server:port/thredds/admin/collection/trigger?collection=name&trigger=nocheck. This does not rescan the directory, it simply recreates the dataset using the current index.
  5. Also see enabling Remote Management

Static vs. changing datasets

There are several way to update a feature collection when it changes, specified by attributes on the update element:

  1. recheckAfter attribute: causes a directory scan whenever a request comes in and the specified time has elapsed since the last scan. The request waits until the scan is finished and a new collection is built. This is called synchronous updating.
  2. rescan and startup attributes: uses a background thread to keep the collection updated, so that requests never wait. This is called asynchronous updating.
  3. trigger attribute: allows a trigger to be sent to the TDS to tell it to update the collection. This is called user controlled updating.
  4. tdm element: for GRIB collections, you may use the TDM to do all index updating. This is called external program updating.

Static Collection - Small or Rarely Used

If you have a collection that doesn't change, do not include an update element. The first time that the dataset is accessed, it will be read in and then never changed.

Static Collection - Fast response

If you have a collection that doesn't change, but you want to have it ready for requests, then use:

<update startup ="always" />

The dataset will be scanned at startup time and then never changed.

Large Static Collection

You have a large collection, which takes a long time to scan. You must carefully control when/if it will be scanned.

<update startup ="nocheck" />

The dataset will be read in at startup time by using the existing indexes (if they exist). If indexes dont exist, they will be created on startup.

If it occasionally changes, then you want to manually tell it when to rescan:

<update startup ="nocheck" trigger="allow" />

The dataset will be read in at startup time by using the existing indexes, and you manually tell it when to rebuild the index. You must enable triggers.

Changing Collection - Small or Rarely Used

For collections that change but are rarely used, use the recheckAfter attribute on the update element. This minimizes unneeded processing for lightly used collections. This is also a reasonable strategy for small collections which don't take very long to build.

<update recheckAfter="15 min" />

Do not include both a recheckAfter and a rescan attribute. If you do, the recheckAfter will be ignored.

Changing Collection - Fast response

When you want to ensure that requests are answered as quickly as possible, read it at startup and also update the collection in the background using rescan:

<update startup="test" rescan="0 20 * * * ? *" />

This cron expression says to rescan the collection files every hour at 20 past the hour, and rebuild the dataset if needed.

Sporadically changing Collection

To externally control when a collection is updated, use:

<update trigger="allow" />

You must enable remote triggers, and when the dataset changes, send a message to a special URL in the TDS.

Changing GRIB Collection

You have a GRIB collection that changes. The TDS can only scan/write indices at startup time. You must use the TDM to detect any changes.

<update startup="test" trigger="allow"/>
<tdm rewrite="test" rescan="0 0/15 * * * ? *" trigger="allow"/>

The dataset will be read in at startup time by the TDS using the existing indexes, and will be scanned by the TDM every 15 minutes, which will send a trigger as needed.

Very Large GRIB Collection that doesnt change

You have a very large collection, which takes a long time to scan. You must carefully control when/if it will be scanned.

<update startup="never"/>
<tdm rewrite="test"/>

The TDS never scans the collection, it always uses existing indices, which must already exist. Run the TDM first, then after the indices are made, you can stop the TDM and start the TDS.

Very Large GRIB Collection that changes

You have a very large collection which changes, and takes a long time to scan. You must carefully control when/if it will be scanned.

<update startup="never" trigger="allow"/>
<tdm rewrite="test" rescan="0 0 3 * * ? *" />

The dataset will be read in at startup time by using the existing indexes which must exist. The TDM will test if its changed once a day at 3 am, and send a trigger to the TDS if needed.


NcML Modifications

NcML is no longer used to define the collection, but it may still be used to modify the feature collection dataset, for FMRC or Point (not GRIB).

<featureCollection featureType="FMRC" name="RTOFS Forecast Model Run Collection" path="fmrc/rtofs">
1) <collection spec="c:/rps/cf/rtofs/.*ofs_atl.*\.grib2$" recheckAfter="10 min" olderThan="5 min"/>

2) <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
    <variable name="time">
      <attribute name="units" value="hours since 1953-11-29T08:57"/>
     </variable>
   </netcdf>

   <protoDataset>
3)  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
      <attribute name="speech" value="I'd like to thank all the little people..."/>
      <variable name="mixed_layer_depth">
       <attribute name="long_name" value="mixed_layer_depth @ surface"/>
       <attribute name="units" value="m"/>
      </variable>
     </netcdf>
   </protoDataset>
   
</featureCollection>

where:

  1. The collection is defined by a collection element, allowing any number of forecast times per file
  2. When you want to modify the component files of the collection, you put an NcML element inside the featureCollection element. This modifies the component files before they are turned into a gridded dataset. In this case we haved fixed the time coordinate units attribute, otherwise the individual files would not get recognized as Grid datasets, and the feature collection will fail.
  3. When you want to modify the resulting FMRC dataset, you put an NcML element inside the protoDataset element. In this case we have added a global attribute named speech and 2 attributes on the variable named mixed_layer_depth.

Also see:


This document was last updated April 2015