Configuring TDS with the FeatureCollection element


Overview

The featureCollection element is a way to tell the TDS to serve collections of CDM Feature Datasets. Currently this is used for gridded and point datasets whose time and spatial coordinates are recognized by the CDM software stack. This allows the TDS to automatically create logical datasets composed of collections of files, and to allow subsetting in coordinate space on them, eg through the WMS, WCS, and Netcdf Subsetting Service.

The featureCollection element is new in TDS 4.2 and replaces the fmrcDataset element in earlier versions. TDS 4.2 allows featureType = FMRC, Point, and Station. TDS 4.3 allows featureType = GRIB, which can only be used for collections of GRIB2 files.

A fair amount of the complexity of feature collections is managing the collection of files on the server, both in creating indexes for performance, and in managing collections that change. For high-performance servers, its better to let a background process manage indexing, and the THREDDS Data Manager (TDM) is an experimental application for this purpose available in TDS 4.3.

Contents:

This document gives an overview of Feature Collections, as well as a complete syntax of allowed elements. For featureType specific information, see:


Example catalog elements

Simple case using defaults:

<featureCollection name="NCEP-NAM-Polar_90km" featureType="FMRC" path="fmrc/NCEP/NAM/Polar_90km">
  <collection spec="/data/ldm/pub/native/grid/NCEP/NAM/Polar_90km/NAM_Polar_90km_#yyyyMMdd_HHmm#.grib2$"/>

</featureCollection>

Fully specify the options:

<featureCollection name="NCEP-NAM-Polar_90km" featureType="FMRC" harvest="true" path="fmrc/NCEP/NAM/Polar_90km">
  <collection spec="/data/ldm/pub/native/grid/NCEP/NAM/Polar_90km/NAM_Polar_90km_#yyyyMMdd_HHmm#.grib2$"
          recheckAfter="15 min" olderThan="5 min"/>
  <update startup="true" rescan="0 5 3 * * ? *" />
  <protoDataset choice="Penultimate" change="0 2 3 * * ? *" />
  <fmrcConfig regularize="true" datasetTypes="TwoD Best Files Runs ConstantForecasts ConstantOffsets" />
</featureCollection>

With NcML elements:

<featureCollection name="NCEP-NAM-Polar_90km" featureType="FMRC" harvest="true" path="fmrc/NCEP/NAM/Polar_90km">
  <collection spec="/data/ldm/pub/native/grid/NCEP/NAM/Polar_90km/NAM_Polar_90km_#yyyyMMdd_HHmm#.grib2$"
          recheckAfter="15 min"  olderThan="5 min"/>
  <update startup="true" rescan="0 5 3 * * ? *" />
  <protoDataset choice="Penultimate" change="0 2 3 * * ? *" >
    <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<attribute name="History" value="processed by Rectilyser 6.23a"/>
</netcdf> </protoDataset> <fmrcConfig regularize="true" datasetTypes="TwoD Best Files Runs ConstantForecasts ConstantOffsets" /> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<attribute name="Conventions" value="CF-1.6"/>
</netcdf> </featureCollection>

Description of elements in TDS Configuration catalogs

featureCollection element

A featureCollection is a kind of dataset element, and so can contain the same elements and attributes of that element. Following is the XML Schema definition, which shows only the elements and attributes that are particular to a featureCollection:

  <xsd:element name="featureCollection" substitutionGroup="dataset">
<xsd:complexType>
<xsd:complexContent>
<xsd:extension base="DatasetType">
<xsd:sequence>
<xsd:element type="collectionType" name="collection"/>
<xsd:element type="updateType" name="update" minOccurs="0"/>
<xsd:element type="manageType" name="manage" minOccurs="0"/>
<xsd:element type="protoDatasetType" name="protoDataset" minOccurs="0"/> <xsd:element type="fmrcConfigType" name="fmrcConfig" minOccurs="0"/>
<xsd:element type="pointConfigType" name="pointConfig" minOccurs="0"/>
<xsd:element type="gribConfigType" name="gribConfig" minOccurs="0"/>
<xsd:element ref="ncml:netcdf" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="featureType" type="featureTypeChoice" use="required"/>
<xsd:attribute name="path" type="xsd:string" use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
Here is an example featureCollection as you might put it into a TDS catalog:
1)<featureCollection name="Metar Station Data" harvest="true" featureType="Station" path="nws/metar/ncdecoded">
2) <metadata inherited="true">
    <serviceName>fullServices</serviceName>
<documentation type="summary">Metars: hourly surface weather observations</documentation> <documentation xlink:href="http://metar.noaa.gov/" xlink:title="NWS/NOAA information"/> <keyword>metar</keyword> <keyword>surface observations</keyword> </metadata> 3) <collection spec="/data/ldm/pub/decoded/netcdf/surface/metar/Surface_METAR_#yyyyMMdd_HHmm#.nc$" /> 4) <update startup="true" rescan="0 0/15 * * * ? *" trigger="allow"/> 5) <protoDataset choice="Penultimate" /> 6) <pointConfig datasetTypes="cdmrFeature Files"/> 7) <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<attribute name="Conventions" value="CF-1.6"/>
</netcdf> </featureCollection>
  1. A featureCollection is declared, using the name and harvest attributes declared by the dataset element. The featureType is a mandatory attribute defining the type of the feature collection. The path is also required, which defines what the URL of this collection will be. It must be unique over the entire TDS. If an ID attribute is not specified on the featureCollection, the path attribute is used as the ID (this is a recommended idiom).
  2. As is usual with dataset elements, a block of metadata can be declared that will be inherited by all the datasets.
  3. The collection of files is defined. Each dataset is assigned a nominal time by extracting a date from the filename.
  4. Specify that the collection is updated in a backround thread, every 15 minutes.
  5. The prototype dataset is the next-to-last in the collection when sorted by time.
  6. Configuration specific to the Point feature type: expose a cdmrRemote service on the entire collection, and also serve all the component files using the default service, in this example the compound service fullServices.
  7. This NcML wraps each dataset in the collection. This attribute overrides any existing one in the datasets; it tells the CDM to parse the station information using the CF Conventions.

collection element

A collection element defines the collection of datasets. Takes the place of NcML aggregation element (scan and scanFmrc).

<collection spec="/data/ldm/pub/native/satellite/3.9/WEST-CONUS_4km/WEST-CONUS_4km_3.9_#yyyyMMdd_HHmm#.gini$"
            name="WEST-CONUS_4km" olderThan="1 min" olderThan="15 min" />
The XML Schema:
  <xsd:complexType name="collectionType">
1) <xsd:attribute name="spec" type="xsd:string" use="required"/>
2) <xsd:attribute name="name" type="xsd:token"/>
3) <xsd:attribute name="olderThan" type="xsd:string" />
4) <xsd:attribute name="recheckAfter" type="xsd:string" />
5) <xsd:attribute name="dateFormatMark" type="xsd:string"/>
6) <xsd:attribute name="timePartition" type="xsd:string"/>
</xsd:complexType>

where

  1. spec: collection specification string (required). In this example, all files in the directory /data/ldm/pub/native/satellite/3.9/WEST-CONUS_4km/ whose filename matches the reqular expression WEST-CONUS_4km_3.9_........_....\.gini$". Each dataset is assigned a nominal time by matching yyyyMMdd_HHmm to the portion of the filename following "WEST-CONUS_4km_3.9_".
  2. name: collection name must be unique in all of your TDS catalogs. This is used for external triggers and as an easy to read identifier for indexing, logging and debugging. If missing, the spec string is used (not a good idea in the context of the TDS).
  3. olderThan: (optional) Only files whose lastModified date is older than this are included. This excludes files that are in the process of being written.
  4. recheckAfter: (optional) This will cause a new scan whenever a request comes in and this much time has elapsed since the last scan. The request will wait until the scan is finished and a new collection is built (if needed).
  5. dateFormatMark: the collection specification string can only extract dates from the file name, as opposed to the file path, which includes all of the parent directory names. Use the dateFormatMark in order to extract the date from the full path.
  6. timePartition: experimental, not complete yet.

Feature Collections need to know how to sort the collection of files, so its recommended that you have a date in the filename, and to specify a date extractor in the specification string or include a dateFormatMark attribute. Otherwise, files will be sorted by filename.

protoDataset element

Provides control over the choice of the prototype dataset for the collection. The protype dataset is used to populate the metadata for the feature collection.

<protoDataset choice="Penultimate" param="0" change="expr">
  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<attribute name="CF:FeatureType" value="station"/>
</netcdf> </protoDataset>
 <xsd:complexType name="protoDatasetType">
   <xsd:sequence>
1)   <xsd:element ref="ncml:netcdf" minOccurs="0"/>
   </xsd:sequence>
2) <xsd:attribute name="choice" type="protoChoices"/>
3) <xsd:attribute name="change" type="xsd:string"/>
4) <xsd:attribute name="param" type="xsd:string"/>
 </xsd:complexType>
 <xsd:simpleType name="protoChoices">
  <xsd:union memberTypes="xsd:token">
   <xsd:simpleType>
    <xsd:restriction base="xsd:token">
      <xsd:enumeration value="First"/>
      <xsd:enumeration value="Random"/>
      <xsd:enumeration value="Penultimate"/>
      <xsd:enumeration value="Latest"/>
    </xsd:restriction>
   </xsd:simpleType>
  </xsd:union>
 </xsd:simpleType>

where:

  1. ncml:netcdf = (optional) ncml elements that modify the prototype dataset
  2. choice= "First | Random | Penultimate | Latest" : select prototype from a time ordered list, using the first, a randomly selected one, the next to last, or the last dataset in the list. The default is "Penultimate".
  3. change= "cron expr" (optional). On rolling datsets, you need to change the prototype periodically, otherwise it will get deleted eventually. This attribute specifies when the protoDataset should be reselected, using a cron expression.
  4. param= (not implemented) used only with choice="Run". Names the run to use, in hours since 0Z. For example, choice="Run" param="0" means to use the latest 0Z dataset run

The choice of the protoDataset matters when the datasets are not homogenous:

  1. Global and variable attributes are taken from the prototype dataset.
  2. If a variable appears in the prototype dataset, it will appear in the feature collection dataset. If it doesnt appear in other datasets, it will have missing data for those times.
  3. If a variable does not appears in the prototype dataset, it will not appear in the feature collection dataset, even if it appeats in other datasets.

update element

For collections that change, the update element provides options to update the collection in a background task. New collections are built in the background, so that requests do not wait.

<update startup="true" rescan="cron expr" trigger="allow" />
The XML Schema definition for the update element:
   <xsd:complexType name="updateType">
1) <xsd:attribute name="startup" type="xsd:boolean"/>
2) <xsd:attribute name="rescan" type="xsd:token"/>
3) <xsd:attribute name="trigger" type="xsd:token"/>
</xsd:complexType>

where:

  1. startup= if true, when the server starts up, rescan and create the collection. If = nocheck then assume the collection hasnt changed since before, and just create the collection (in memory). If its the first time that the TDS is being run with this collection, the collection information is always created and in some cases cached.
  2. rescan= "cron expr" uses a cron expression to specify when the collection should be rescanned in a background task.
  3. trigger= if set to "allow", then external triggering will be allowed. This allows collections to be updated only when needed, by an external program (or person) sending an explicit "trigger" URL to the server. This URL is protected by HTTPS, so you must enable remote access for this to work.
    1. The URL is "https://server:port/thredds/admin/collection?trigger=true&collection=<name>", where name is the featureCollection name.
    2. You can see a list of the FMRC datasets (and manually trigger a rescan) on the page "https://server:port/thredds/admin/debug?Collections/triggerRescan".
    3. You can see what datasets are currently cached in each collection on the page "https://server:port/thredds/admin/debug?Collections/showFmrcCache"

manage element (NOT IMPLEMENTED YET)

This instructs the TDS to manage your collection by deleting files that are older than a certain time.

<manage deleteAfter="30 days" check="cron expr" />

where:

  1. deleteAfter= delete files older than this amount
  2. check= "cron expr" uses a cron expression to specify when the collection should be checked for old files.

Static vs. changing datasets

There are two way to update a feature collection when it changes, without having to restart the TDS:

  1. recheckAfter attribute on the collection element: causes a directory scan whenever a request comes in and the specified time has elapsed since the last scan. The request waits until the scan is finished and a new collection is built.
  2. update element : uses a background thread to keep the collection updated, so that requests never wait.

Static Datasets

If you have a collection that doesnt change, do not use the recheckAfter or the rescan atribute. Instead, use:

<update startup ="nocheck" />

which assumes that the collection has not changed since the last time the TDS was run. This saves a lot of processing on large collections that you know dont change.

If you want the collection to be tested at startup to see if it has changed since the last time the TDS was run, use:

<update startup ="true" />

Otherwise the collection will be checked for changes and created when the first request for it comes in.

Changing Collection - Small or rarely used

For collections that change but are rarely used, use the recheckAfter attribute on the collection element. This minimizes unneeded processing for lightly used collections. This is also a good strategy for small collections which don't take very long to build.

Changing Collection - Fast response

When you want to ensure that requests are answered as quick as possible, update the collection in the background using the rescan attribute of the update element.

Sporadically changing Collection

To externally control when a collection is updated, use:

<update trigger ="allow" />

You must enable remote management. When the dataset changes, send a message to a special URL in the TDS.


NcML Modifications

NcML is no longer used to define the collection, but it may still be used to modify the feature collection dataset.

Old way:

<datasetFmrc name="RTOFS Forecast Model Run Collection" path="fmrc/rtofs">
  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">

 1) <variable name="mixed_layer_depth">
     <attribute name="long_name" value="mixed_layer_depth @ surface"/>
     <attribute name="units" value="m"/>
    </variable>

   <aggregation dimName="runtime" type="forecastModelRunSingleCollection" timeUnitsChange="true" recheckEvery="10 min">
 
 2)  <variable name="time">
      <attribute name="units" value="hours since "/>
     </variable>
   
 3)  <scanFmrc location="c:/rps/cf/rtofs" regExp=".*ofs_atl.*\.grib2$" 
       runDateMatcher="#ofs.#yyyyMMdd" forecastOffsetMatcher="HHH#.grb.grib2#" subdirs="true"
       olderThan="10 min"/> 
   </aggregation>
  </netcdf>
 </datasetFmrc>

where:

  1. on the outside of the aggregation, attributes are being added/modified for the existing variable "mixed_layer_depth" in the resulting FMRC dataset.
  2. on the inside of the aggregation, an attribute is being added/modified for the existing variable "time" for each dataset in the collection. Typically you need to do this in order to make the component files into a gridded dataset.
  3. the collection is defined by a scanFmrc element, creating a forecastModelRunSingleCollection with one forecast time per file

New way:

<featureCollection name="RTOFS Forecast Model Run Collection" path="fmrc/rtofs">
1) <collection spec="c:/rps/cf/rtofs/.*ofs_atl.*\.grib2$" recheckAfter="10 min" olderThan="5 min"/>

2) <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
    <variable name="time">
      <attribute name="units" value="hours since "/>
     </variable>
   </netcdf>

   <protoDataset>
3)  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
      <variable name="mixed_layer_depth">
       <attribute name="long_name" value="mixed_layer_depth @ surface"/>
       <attribute name="units" value="m"/>
      </variable>
     </netcdf>
   </protoDataset>
   
</featureCollection>

where:

  1. the collection is now defined by a collection element, allowing any number of forecast times per file
  2. when you want to modify the component files of the collection, you put an NcML element inside the featureCollection element. This modifies the component files before they are turned into a gridded dataset.
  3. when you want to modify the resulting FMRC dataset, you put an NcML element inside the protoDataset element. This defines the "prototypical" dataset used as the template for the resulting FMRC datasets.


This document is maintained by John Caron and was last updated June 2011