Configuring TDS with DatasetScan


The datasetScan element specifies the data locations that the TDS will scan for datasets when generating catalogs. It also specifies which URLs will point to the data in those directories.

Example

Here is a minimal catalog containing a datasetScan element:

<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">

<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
</datasetScan >
</catalog>

The main points are:

  1. The path attribute on the datasetScan element is the part of the URL that identifies this datasetScan and is used to map dataset URLs to a location.
  2. The location attribute on the datasetScan element gives the location of the dataset collection on the local file system.

In the catalog that the TDS server sends to any client, the datasetScan element is shown as a catalog reference:

<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1" 
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">

<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<catalogRef xlink:href="/thredds/catalog/ncep/catalog.xml" xlink:title="NCEP Data" name="" />
</catalog>

The catalog will be generated dynamically on the server when requested, by scanning the server's directory /data/ldm/pub/native/grid/NCEP/. For example, if the directory looked like:

/data/ldm/pub/native/grid/NCEP/
GFS/
CONUS_191km/
GFS_CONUS_191km_20061107_0000.grib1
GFS_CONUS_191km_20061107_0000.grib1.gbx
 GFS_CONUS_191km_20061107_0600.grib1
GFS_CONUS_191km_20061107_1200.grib1
CONUS_80km/
...
...
NAM/
...
NDFD/
...

The result of a request for "/thredds/catalog/ncep/catalog.xml" might look like:

<catalog ...>
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<dataset name="NCEP Data">
<metadata inherited="true">
<serviceName>myserver</serviceName>
 </metadata>
<catalogRef xlink:title="GFS" xlink:href="GFS/catalog.xml" name="" />
<catalogRef xlink:title="NAM" xlink:href="NAM/catalog.xml" name="" />
<catalogRef xlink:title="NDFD" xlink:href="NDFD/catalog.xml" name="" />
</dataset>
</catalog>

and for a "/thredds/catalog/ncep/GFS/CONUS_191km/catalog.xml" request:

<catalog ...>
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<dataset name="ncep/GFS/CONUS_191km">
<metadata inherited="true">
<serviceName>myserver</serviceName>
 </metadata>
<dataset name="GFS_CONUS_191km_20061107_0000.grib1"
 urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0000.grib1" />
<dataset name="GFS_CONUS_191km_20061107_0000.grib1.gbx"
 urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0000.grib1.gbx" />
<dataset name="GFS_CONUS_191km_20061107_0000.grib1"
 urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0600.grib1" />
<dataset name="GFS_CONUS_191km_20061107_0000.grib1"
 urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_1200.grib1" />
</dataset>
</catalog>

Note that:

  1. Files are turned into datasets, subdirectories are turned into nested catalogRef elements.
  2. All the catalog URLs are relative. If the original catalog URL is http://server:8080/thredds/catalog.xml then the first catalogRef xlink:href value of "/thredds/catalog/ncep/catalog.xml" resolves to http://server:8080/thredds/catalog/ncep/catalog.xml. From that catalog, the catalogRef xlink:href value of "GFS/catalog.xml" resolves to http://server:8080/thredds/catalog/ncep/GFS/catalog.xml.
  3. The dataset access URLs are built from the service base attribute and the dataset urlPath attribute (see the THREDDS URL construction documentation). So the dataset URLs from the above catalog would be http://server:8080/thredds/dodsC/ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0000.grib1. You don't have to worry about these URLs, as they are all generated for you.
  4. Each datasetScan element must reference a service element (whether directly, as above, or inherited).
  5. Because the TDS uses the set of all given path values to map URLs to datasets, each path value MUST be unique across all config catalogs on a given TDS installation.
[Workshop: Sample config catalog 1.]

Inherited Metadata

The datasetScan element is an extension of a dataset element, and it can contain any of the metadata elements that a dataset can. Typically you want all of its contained datasets to inherit the metadata, so add an inherited metadata element contained in the datasetScan element:

<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1" 
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0">

<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/">

<metadata inherited="true">
<serviceName>myserver</serviceName>
<authority>unidata.ucar.edu:</authority>
<dataType>Grid</dataType>
</metadata>

</datasetScan>
</catalog>

Including Only the Desired Files

A datasetScan element can specify which files and directories it will include with a filter element (see spec for more details). When no filter element is given, all files and directories are included in the generated catalog(s). Adding a filter element to your datasetScan element allows you to include (and/or exclude) the files and directories as you desire. For instance, the following filter and selector elements will only include files that end in ".grib1" and exclude any file that ends with "*_0000.grib1".

<filter>
<include wildcard="*.grib1"/>
<exclude wildcard="*_0000.grib1"/>
</filter>

You can specify which files to include or exclude using either wildcard patterns (with the wildcard attribute) or regular expressions (using the regExp attribute). If the wildcard pattern (or the regular expression) matches the dataset name, the dataset is included or excluded as specified. By default, includes and excludes apply only to regular files (atomic datasets). You can specify that they apply to directories (collection datasets) as well by using the atomic and collection attributes. For instance, the additional selector in this filter element means that only directories that don't start with "CONUS" will be cataloged:

<filter>
<include wildcard="*.grib1"/>
<exclude wildcard="*_0000.grib1"/>
<exclude wildcard="CONUS*" atomic="false" collection="true"/>

</filter>

Its a good idea to always include a filter element, so that if stray files accidentally get into your data directories, they wont generate erroneous catalog entries. A good choice for this datasetScan would be something like:

<filter>
<include wildcard="*.grib1"/>
<include wildcard="*.grib2"/>
<exclude wildcard="*.gbx"/>
</filter>

[Workshop: Sample config catalog 2.]

Generating IDs

All generated datasets are given an ID. The IDs are simply the path of the dataset appended to the datasetScan path value or, if one exists, the ID of the datasetScan element. For the example above, the file "c:/data/grib2/data1.grib1" would result in a dataset with the ID "grib2/data1.grib1". By adding an ID of "my/data/model" to the datasetScan element, the resulting ID would change to "my/data/model/data1.grib1".

Naming Datasets

If no namer element is specified, all datasets are named with the corresponding file name. By adding a namer element, you can specify more human readable dataset names. The following namer looks for datasets named "GFS" or "NCEP" and renames them with the corresponding replace string:
<namer>
<regExpOnName regExp="GFS" replaceString="NCEP GFS model data" />
<regExpOnName regExp="NCEP" replaceString="NCEP model data"/>
</namer>

More complex renaming is possible as well. The namer uses a regular expression match on the dataset name. If the match succeeds, any regular expression capturing groups are used in the replacement string.

A capturing group is a part of a regular expression enclosed in parenthesis. When a regular expression with a capturing group is applied to a string, the substring that matches the capturing group is saved for later use. The captured strings can then be substituted into another string in place of capturing group references,"$n", where "n" is an integer indicating a particular capturing group. (The capturing groups are numbered according to the order in which they appear in the match string.) For example, the regular expression "Hi (.*), how are (.*)?" when applied to the string "Hi Fred, how are you?" would capture the strings "Fred" and "you". Following that with a capturing group replacement in the string "$2 are $1." would result in the string "you are Fred."

Here's an example namer:

<namer>
<regExpOnName regExp="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2})"
replaceString="NCEP GFS 191km Alaska $1-$2-$3 $4:$5:00 GMT"/>
</namer

the regular expression has five capturing groups

  1. The first capturing group, "([0-9]{4})",  captures four digits, in this case the year.
  2. The second capturing group, "([0-9]{2})", captures two digits, in this case the month.
  3. The third capturing group, "([0-9]{2})", captures two digits, in this case the day of the month.
  4. The fourth capturing group, "([0-9]{2})", captures two digits, in this case the hour of the day.
  5. The fifth capturing group, "([0-9]{2})", captures two digits, in this case the minutes of the hour.
When applied to the dataset name "GFS_Alaska_191km_20051011_0000.grib1",  the strings "2005", "10", "11", "00", and "00" are captured. After replacing the capturing group references in the replaceString attribute value, we get the name "NCEP GFS 191km Alaska 2005-10-11 00:00:00 GMT". So, when cataloged, this dataset would end up as something like this:
<dataset name="NCEP GFS 191km Alaska 2005-10-11 00:00:00 GMT"
 urlPath="models/NCEP/GFS/Alaska_191km/GFS_Alaska_191km_20051011_0000.grib1"/>

Sorting Datasets

Without a sort element, datasets at each collection level are listed in their "natural" order. With a sort element you can specify that they are to be sorted by lexigraphic order by name either in increasing or decreasing order. For example:

<sort>
<lexigraphicByName increasing="false" />
</sort>

Other sort order functionality will be looked at for future enhancements.

Notes:

  1. The "natural" order of the datasets is determined by the order returned by the listDatasets() method in CrawlableDataset.
  2. The sort is done on the CrawlableDataset list. The naming discussed in the previous section is done to the resulting InvDataset. Therefore, the naming discussed above does not affect the sort order.

Adding Proxy Datasets

The addProxies element provides a place for describing proxy datasets you want to add to the collection. Currently, two types of  "latest" proxy datasets are all that is supported. The simpleLatest element adds the described dataset which proxies the existing dataset whose name is lexigraphically greatest (which finds the latest dataset assuming a timestamp is part of the dataset name). The latestComplete element behaves similarly to simpleLatest except that the proxied dataset does not include any datasets that have been modified more recently than a given limit, e.g., you could specify you want the most recent (lexigraphically) dataset that hasn't been modified for 60 minutes. Both the simpleLatest and latestComplete elements must point to an existing service.

So, the datasetScan might look like this:

  <service name="latest" type="Resolver" base="" />
<datasetScan name="GRIB2 Data" path="grib2" location="c:/data/grib2/" serviceName="myserver" >
<addProxies>
<simpleLatest />
<latestComplete name="latestComplete.xml" top="true" serviceName="latest" lastModifiedLimit="60" />
</addProxies>
</datasetScan>

The latestComplete element includes a name attribute which provides the name of the proxy dataset, the serviceName attribute that references the service used by the proxy dataset, the top attribute which indicates if the proxy dataset should appear at the top or bottom of the list of datasets in this collection, and the lastModifiedLimit which feeds into the algorithm which determines which dataset is being proxied.

The simpleLatest element allows for the same attributes as the latestComplete element minus the lastModifiedLimit attribute. In this case, all the attributes have default values: the name attribute defaults to "latest.xml", the top attribute defaults to "true", and the serviceName attribute defaults to "latest".

The results would be something like:

  <dataset name="GRIB2 Data" ID="testdata">
<dataset name="latestComplete.xml" serviceName="latest" urlPath="latestComplete.xml" />
<dataset name="latest.xml" serviceName="latest" urlPath="latest.xml" />
<dataset name="200610130730.nc" urlPath="200610130730.nc" />
  <dataset name="200406190916.nc" urlPath="200406190916.nc" />
</dataset>

More details are available in the Server-side InvCatalog specification document.

Adding Dataset Size Information

The addDatasetSize element indicates that file size metadata should be added to all atomic datasets. Adding to the above example

<datasetScan name="GRIB2 Data" path="grib2" location="c:/data/grib2/" serviceName="myserver">
  <addDatasetSize />
</datasetScan>

results in the addition of a dataSize element to each atomic dataset:

<dataset name="GRIB2 Data" ID="testdata">
<dataset name="data1.grib1" urlPath="data1.grib1">
<dataSize units="Kbytes">6.08</dataSize>
</dataset>
<dataset name="data2.grib1" urlPath="data2.grib1">
<dataSize units="Mbytes">4.961</dataSize>
</dataset>
<catalogRef xlink:href="subdir/catalog.xml" xlink:title="subdir" />
</dataset>

[Workshop: Sample config catalog 3.]

Adding timeCoverage Elements

A datasetScan element may contain an addTimeCoverage element. The addTimeCoverage element indicates that a timeCoverage metadata element should be added to each dataset in the collection and describes how to determine the time coverage for each datasets in the collection.

Currently, the addTimeCoverage element can only construct start/duration timeCoverage elements and uses the dataset name to determine the start time. As described in the "Naming Datasets" section above, the addTimeCoverage element applies a regular expression match to the dataset name. If the match succeeds, any regular expression capturing groups are used in the start time replacement string to build the start time string.These attributes values are used to determine the time coverage:

  1. The datasetNameMatchPattern attribute value is used for a regular expression match on the dataset name. If a match is found, a timeCoverage element is added to the dataset. The match pattern should include capturing groups which allow the match to save substrings from the dataset name.
  2. The startTimeSubstitutionPattern attribute value has all capture group references ("$n") replaced by the corresponding substring that was captured during the match. The resulting string is used as the start value of the resulting timeCoverage element.
  3. The duration attribute value is used as the duration value of the resulting timeCoverage element.

Example 1: The addTimeCoverage element,

<datasetScan name="GRIB2 Data" path="grib2" location="c:/data/grib2/" serviceName="myserver"> 
<addTimeCoverage datasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2}).grib1$"
startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
duration="60 hours" />

</datasetScan>

results in the following timeCoverage element:

  <timeCoverage>
<start>
2005-07-18T12:00:00</start>
<duration>
60 hours</duration>
</timeCoverage>

Future versions will allow more complex determinations of the timeCoverage element.


Sample Config Files

Basic catalog:

<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">

<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
</datasetScan>
</catalog>

Catalog with filtering added:

<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">

<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
<filter>
<include wildcard="*.grib1"/>
<include wildcard="*.grib2"/>
<exclude wildcard="*.gbx"/>
</filter>
  </datasetScan>
</catalog>

Catalog with dataset size added:

<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">

<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
<filter>
<include wildcard="*.grib1"/>
<include wildcard="*.grib2"/>
<exclude wildcard="*.gbx"/>
</filter>
<sort>
<lexigraphicByName increasing="false" />
</sort>
<addDatasetSize/>
  </datasetScan>
</catalog>

This document is maintained by Ethan Davis and was last updated on July 25, 2007