TDS Configuration Catalogs

More THREDDS Catalog Review

The service Element

Compound service Elements - Serving Datasets with Multiple Methods

Datasets can be made available through more than one access method by defining and then referencing a compound service element. The following:

<service name="all" serviceType="Compound" base="" >
    <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/" />
    <service name="wcs" serviceType="WCS" base="/thredds/wcs/" />
</service>

defines a compound service named "all" which contains two nested services. Any dataset that reference the compound service will have two access methods. For instance:

<dataset name="cool data" urlPath="so/cool/data.nc" >
    <serviceName>all</serviceName>
</dataset>

would result in these two access URLs:

/thredds/dodsC/so/cool/data.nc
/thredds/wcs/so/cool/data.nc

Note: The contained services can still be referenced independently. For instance:

<dataset name="more cool data" urlPath="more/cool/data.nc" >
<serviceName>odap</serviceName>
</dataset>

results in a single access URL:

/thredds/dodsC/more/cool/data.nc

Unique Names Required for each service Element in a Catalog

Within a catalog, the service name is used to reference a service element. The service names must therefore be unique in each catalog. [Note: It is not necessary that they be unique globally within a TDS. Only on a catalog by catalog basis.]

<service name="all" serviceType="Compound" base="" >
    <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/" />
    <service name="http" serviceType="HTTPServer" base="/thredds/fileServer/" />
</service>
<service name="grid" serviceType="Compound" base="" >
    <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/" />
    <service name="wcs" serviceType="WCS" base="/thredds/wcs/" />
    <service name="wms" serviceType="WMS" base="/thredds/wms/" />
    <service name="http" serviceType="HTTPServer" base="/thredds/fileServer/" />
</service>

THREDDS Metadata

Linking to Metadata

<metadata xlink:title="some good metadata" xlink:href="http://my.server/md/data1.xml" />

Linking to Human Readable Metadata

<description xlink:title="My Data" xlink:href="http://my.server/md/data1.html" />

Inherited Metadata

...
    <dataset name="TDS Tutorial: example inherited metadata">
        <metadata inherited="true">
            <serviceName>odap</serviceName>
            <description>Really great data.</description>
            <keyword>Ocean</keyword>
            <keyword>Temperature</keyword>
            <creator>Ethan</creator>
            <publisher>Ethan</publisher>
            <date type="created">2008-10-30T14:22</date>
            <dataFormat>netCDF</dateFormat>
        </metadata>

        <dataset name="TDS Tutorial: example data 1" urlPath="test/example1.nc" />
        <dataset name="TDS Tutorial: example data 2" urlPath="test/example2.nc" />
        <dataset name="TDS Tutorial: example data 3" urlPath="test/example3.nc" />
        <dataset name="TDS Tutorial: example data 4" urlPath="test/example4.grib2">
            <dataFormat>GRIB-2</dataFormat>
        </dataset>

    </dataset>
...

Notes:

TDS Configuration Catalogs

TDS Requirements for the service Elements

The TDS provides data access services at predefined URL base paths. Therefore, service base URLs must match the following values:

OPeNDAP

<service name="odap" serviceType="OPeNDAP" base="/thredds/dodsC/" />

NetCDF Subset Service

<service name="ncss" serviceType="NetcdfSubset" base="/thredds/ncss/" />

WCS

 <service name="wcs" serviceType="WCS" base="/thredds/wcs/" />

WMS

 <service name="wms" serviceType="WMS" base="/thredds/wms/" />

HTTP Bulk File Service

<service name="fileServer" serviceType="HTTPServer" base="/thredds/fileServer/" />

Data Requirement for Each Service

  • The "HTTPServer" service can serve any file.
  • The "OPeNDAP" service can serve any data file that the netCDF-Java library can open.
  • The "WCS" service can only serve data files that the netCDF-Java library can recognize as "gridded" data.
  • The "WMS" service also only serves "gridded" data files.
  • The "NetcdfSubset" service also only serves "gridded" data files.

You can check that a data file is recognized as "gridded" with netCDF-Java ToolsUI. (ToolsUI can be found on the netCDF-Java home page.)

TDS Configuration Catalogs and Metadata

The datasetScan element is an extension of the dataset element and so can contain metadata.

...
      <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/" />

2)    <datasetScan name="Test all files in a directory" ID="testDatasetScan"
                   path="my/test/all" location="/my/data/testdata">
          <metadata inherited="true">
              <serviceName>odap</serviceName>
              <keyword>Ocean</keyword>
              <keyword>Temperature</keyword>
              <creator>Ethan</creator>
              <publisher>Ethan</publisher>
              <date type="created">2008-10-30T14:22</date>
          </metadata>
      </datasetScan>
...

All generated catalogs that are descendants of this datasetScan will contain all inherit metadata contained in the datasetScan element. For instance, here is a resulting child catalog:

...
      <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/" />

      <dataset name="Test all files in a directory" ID="testDatasetScan" >
          <metadata inherited="true">
              <serviceName>odap</serviceName>
              <keyword>Ocean</keyword>
              <keyword>Temperature</keyword>
              <creator>Ethan</creator>
              <publisher>Ethan</publisher>
              <date type="created">2008-10-30T14:22</date>
          </metadata>
          <dataset name="afile.nc" ID="testDatasetScan/afile.nc" urlPath="my/test/all/afile.nc">
          <dataset name="testData.nc" ID="testDatasetScan/afile.nc" urlPath="my/test/all/testData.nc">
          <dataset name="junk.nc" ID="testDatasetScan/afile.nc" urlPath="my/test/all/junk.nc">

          <catalogRef xlink:title="grib" ID="testDatasetScan/grib" name=""
                      xlink:href="/thredds/catalog/my/test/all/grib/catalog.xml" />
      </dataset>
...

TDS Root Catalog

At startup, the TDS reads the root catalog

${TOMCAT_HOME}/content/thredds/catalog.xml

and recursively all configuration catalogs that are linked to it through a relative catalogRef element . The resulting tree of catalogs are used as the top-level catalogs served by the TDS. In the case of our distributed root catalog, the tree looks like:

catalog.xml
|
|-- enhancedCatalog.xml

The tree of configuration catalogs can be as deeply nested as desired.

Additional Root Catalogs

Additional root configuration catalogs can be defined in

${TOMCAT_HOME}/content/thredds/threddsConfig.xml 

file. For instance, to add a test catalog add the following line:

<catalogRoot>myTestCatalog.xml</catalogRoot>

Each additional root configuration catalog can be the root of another tree of configuration catalogs.

Tools to Manage Configuration Catalogs

First, the TDS catalog errors log

${TOMCAT_HOME}/content/thredds/logs/catalogErrors.log 

contains all warning and error messages from parsing the configuration catalogs. As such, it is a great place to look for information if you run into problems with your TDS configuration catalogs

Second, the TDS Remote Management page provides access to a list of all the configuration catalogs the TDS has successfully read:

Managing datasetRoot and datasetScan Elements

You can have as many datasetRoot and datasetScan elements as you want, for example

<datasetRoot path="model" location="/data/ncep" />
<datasetRoot path="obs" location="/data/raw/metars" />
<datasetRoot path="cases/001" location="C:/casestudy/data/001" />
<datasetScan path="myData" location="/data/ncep/run0023" name="NCEP/RUN 23" serviceName="myserver" />
<datasetScan path="myData/gfs" location="/pub/ldm/gfs" name="NCEP/GFS" serviceName="myserver" />

The datasetRoot and datasetScan are said to define a data root.

The Rules for Data Roots

  • Each accessible dataset must be associated with a data root, i.e. the beginning part of its path must match a data root path. If there are multiple matches, the longest match is used.
  • Each data root must have a unique path for all catalogs used by the TDS.

    Note: Because the TDS uses the set of all given path values to map URLs to datasets, each path value MUST be unique across all config catalogs on a given TDS installation. Duplicates will cause warning messages in the catalogErrors.log file.

  • The directory pointed to by location should be absolute
  • The locations may be used in multiple data roots

For example, using the above data roots, the following matches would be made:

urlPath file
model/run0023/mydata.nc /data/ncep/run0023/mydata.nc
obs/test.nc /data/raw/metars/test.nc
myData/mydata.nc /data/ncep/run0023/mydata.nc
myData/gfs/mydata.nc /pub/ldm/gfs/mydata.nc
cases/001/test/area/two C:/casestudy/data/001/test/area/two

The structure of a full OPeNDAP URL for the first urlPath above would look like:

http://hostname:port/thredds/dodsC/model/run0023/mydata.nc
|<---  server   --->|<----->|<--->|<--->|<-   filename  ->|
                        |      |     |
           webapp name -|      |     |- data root
                               |
                      service -|

TDS Remote Management - List of Dataset Roots

The TDS Remote Management page has a link to list all known dataset roots:

Exercise: Managing multiple roots

  1. Add a few more datasetScan elements (/data/ldm/fsl, /data/ldm/madis, /data/ldm/suomi):
    [thredds@workshop00 ~]$ ls /data/ldm
    bufr dusk dusk.080527 fsl ldm.pq ltng mcidas ngrid nogaps rcm severe surface wseta
    cosmic dusk.080522 forecasts gempak logs madis nam_12km nldn rawfiles rtmodel suomi upperair
    [thredds@workshop00 ~]$ ls /data/ldm/fsl
    01hr 06min RASS
    [thredds@workshop00 ~]$ ls /data/ldm/fsl/01hr
    20082962000.nc 20082981400.nc 20083000800.nc 20083020200.nc 20083032000.nc 20083051400.nc 20083071000.nc 20083090400.nc
    ...
    20082981200.nc 20083000600.nc 20083020000.nc 20083031800.nc 20083051200.nc 20083070800.nc 20083090200.nc 20083102100.nc
    [thredds@workshop00 ~]$ ls /data/ldm/madis
    20081022_0700.nc 20081024_1000.nc 20081026_1300.nc 20081028_1600.nc 20081030_1900.nc 20081101_2200.nc 20081104_0100.nc
    ...
    20081024_0300.nc 20081026_0600.nc 20081028_0900.nc 20081030_1200.nc 20081101_1500.nc 20081103_1800.nc 20081105_2100.nc
    [thredds@workshop00 ~]$ ls /data/ldm/suomi
    CsuPWVh_2008.308.18.00.0060_nc CsuPWVh_2008.309.07.00.0060_nc CsuPWVh_2008.309.20.00.0060_nc CsuPWVh_2008.310.09.00.0060_nc
    ...
    CsuPWVh_2008.309.04.00.0060_nc CsuPWVh_2008.309.17.00.0060_nc CsuPWVh_2008.310.06.00.0060_nc CsuPWVh_2008.310.19.00.0060_nc
  2. Edit the main TDS configuration catalog:
    [thredds@workshop00 ~]$ cd ${TOMCAT_HOME}/content/thredds
    [thredds@workshop00 ~]$ vi catalog.xml     // Use the editor of your choice
    
  3. And add a datasetScan element for the FSL data:
    <datasetScan name="FSL" ID="FSL"
                 path="fsl" location="/data/ldm/fsl">
    
        <metadata inherited="true">
            <serviceName>thisDODS</serviceName>
        </metadata>
    </datasetScan>
  4. And similarly for MADIS and Suomi data
  5. Reinitialize the TDS configuration catalogs:
    1. Go to the TDS Remote Management page: http://localhost:8080/thredds/admin/debug
    2. Click on the "Reinitialize" link
  6. Test that the new datasetScan elements are working:
    1. Bring the catalog up in a browser: http://localhost:8080/thredds/catalog.html
    2. Browse into the new dataset collections.
    3. Try an OPeNDAP access method link

Now that we have multiple dataset roots ...

  1. Lets check the list of dataset roots:
    1. Go back to the TDS Remote Management page: http://localhost:8080/thredds/admin/debug
    2. Select the "Show data roots" link.
  2. Check the catalogInit.log:
    1. TDS Remote Management page [http://localhost:8080/thredds/admin/debug]
    2. Click the "Show TDS Logs" link.
    3. Select the "catalogInit.log" file

Exercise: Duplicate Roots

  1. Modify the FSL datasetScan element so that the value of the path attribute matches the one for the NAM_12km datasetScan element.
    [thredds@workshop00 ~]$ cd ${TOMCAT_HOME}/content/thredds
    [thredds@workshop00 ~]$ vi catalog.xml     // Use the editor of your choice
    
  2. Reinitialize the TDS [http://localhost:8080/thredds/admin/debug].
  3. What happens with duplicate data roots:
    1. Browse into the FSL dataset [http://localhost:8080/thredds/catalog.html]
    2. Check the list of dataset roots [http://localhost:8080/thredds/admin/debug - click on "Check data roots"]
    3. Check the catalogInit.log [http://localhost:8080/thredds/admin/debug]
  4. Now fix the FSL datasetScan element.

More datasetScan Element

Including Only the Desired Files

A datasetScan element can specify which files and directories it will include with a filter element (see spec for more details). When no filter element is given, all files and directories are included in the generated catalog(s). Adding a filter element to your datasetScan element allows you to include (and/or exclude) the files and directories as desired. The datasetScan element for the NAM_12km example included the following:

<filter>
<include wildcard="*.grib2" />
</filter>

To exclude the analysis data, the filter could be modified to:

<filter>
<include wildcard="*.grib2" />
<exclude wildcard="*f000.grib2" />
</filter>

The include and exclude elements both determine which datasets they match on whether their wildcard pattern (given by the wildcard attribute) or regular expression (given by the regExp attribute) match the dataset name. By default, includes and excludes apply only to regular files (atomic datasets). You can specify that they apply to directories (collection datasets) as well by using the atomic and collection attributes. For example, if the nam_12km directory contained a badData directory, I could exclude it by adding the following to the filter:

<exclude wildcard="badData" atomic="false" collection="true" />

Exercise: Filtering Files

  1. Browse one of the datasets you just added and find a ".scour*" file. Try the OPeNDAP access method:
    Error {
        code = 500;
        message = "Cant read /data/ldm/madis/.scour*: not a valid NetCDF file.";
    };
    
  2. Now add a filter element to the datasetScan elements. Something like:
    <filter>
        <include wildcard="*.nc" />
        <include wildcard="*.grib1" />
        <include wildcard="*.grib2" />
    </filter>
    
  3. Reinitialize the TDS [http://localhost:8080/thredds/admin/debug].
  4. Are the filters working? [http://localhost:8080/thredds/catalog.html]

Exercise: Filtering Directories

  1. Browse around in the "FSL" dataset.
  2. Add a filter element to the "FSL" datasetScan element to exclude the "06min" directories. Something like:
    <exclude wildcard="06min" atomic="false" collection="true" />
    
  3. Reinitialize the TDS [http://localhost:8080/thredds/admin/debug].
  4. Are the filters working? [http://localhost:8080/thredds/catalog.html]

Generating IDs

All generated datasets are given an ID. The IDs are simply the path of the dataset appended to the datasetScan path value or, if one exists, the ID of the datasetScan element. So, for the nam_12km directory and our current configuration:

<datasetScan name="NCEP NAM 12km" ID="NAM_12km" 
path="nam_12km" location="/data/ldm/nam_12km">

and the data file 2008110406f018.grib2, the value of the dataset ID would be "NAM_12km/2008110406f018.grib2".

Naming Datasets

By default, all datasets are named with the corresponding file name. By adding a namer element, you can specify more human readable dataset names. The following namer looks for the dataset named "NAM_12km" and renames it with the replace string:

<namer>
<regExpOnName regExp="NCEP NAM 12km" replaceString="NCEP NAM 12km model data" />
</namer>

More complex renaming is possible as well. The namer uses a regular expression match on the dataset name. If the match succeeds, any regular expression capturing groups are used in the replacement string.

A capturing group is a part of a regular expression enclosed in parenthesis. When a regular expression with a capturing group is applied to a string, the substring that matches the capturing group is saved for later use. The captured strings can then be substituted into another string in place of capturing group references,"$n", where "n" is an integer indicating a particular capturing group. (The capturing groups are numbered according to the order in which they appear in the match string.) For example, the regular expression "Hi (.*), how are (.*)?" when applied to the string "Hi Fred, how are you?" would capture the strings "Fred" and "you". Following that with a capturing group replacement in the string "$2 are $1." would result in the string "you are Fred."

Here's an example namer:

<namer>
<regExpOnName regExp="([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})f([0-9]{3}).grib2"
replaceString="NCEP NAM 12km $1-$2-$3 $4 GMT - Forecast hour: $5"/>
</namer>

the regular expression has five capturing groups

  1. The first capturing group, "([0-9]{4})", captures four digits, in this case the year.
  2. The second capturing group, "([0-9]{2})", captures two digits, in this case the month.
  3. The third capturing group, "([0-9]{2})", captures two digits, in this case the day of the month.
  4. The fourth capturing group, "([0-9]{2})", captures two digits, in this case the hour of the day.
  5. The fifth capturing group, "([0-9]{3})", captures three digits, in this case the forecast hour.

When applied to the dataset name "2008110406f018.grib2", the strings "2008", "11", "04", "06", and "018" are captured. After replacing the capturing group references in the replaceString attribute value, we get the name "NCEP NAM 12km 2008-11-04 06 GMT - Forecast hour: 018".

Exercise: Naming the Suomi Datasets

  1. Add a namer element to the Suomi datasetScan element that extracts the date/time from the file name and uses the date/time in generating a new name (similar to above) the value of the path attribute matches the one for the NAM_12km datasetScan element.

Sorting Datasets

Sorting: Underlying Implementation Details

  1. The "natural" order of the datasets is determined by the order returned by the listDatasets() method in CrawlableDataset.
  2. The sort is done on the CrawlableDataset list. The naming discussed in the previous section is done to the resulting InvDataset. Therefore, the naming discussed above does not affect the sort order.

A sort element can be added to a datasetScan to specify the order in which a collection of datasets are listed. Without a sort element, datasets at each collection level are listed in their "natural" order. Currently, the only supported sort algorithm sorts datasets lexigraphically by name either in increasing or decreasing order. Here's what a sort element looks like:

<sort>
<lexigraphicByName increasing="false" />
</sort>

Adding a "Latest" Proxy Datasets

With a real-time archive, it is convenient to define a "proxy" dataset that always points to the most recent dataset in a collection. Other types of proxy datasets may be useful as well and the addProxies element provides a place for describing proxy datasets. Currently, only two addProxies child elements are defined. They are both "Latest" proxy elements. The simpleLatest element adds a proxy dataset which proxies the existing dataset whose name is lexigraphically greatest (which finds the latest dataset assuming a timestamp is part of the dataset name). The latestComplete element behaves similarly to simpleLatest except that the proxied dataset does not include any datasets that have been modified more recently than a given time limit, e.g., you could specify you want the most recent (lexigraphically) dataset that hasn't been modified for 60 minutes. Both the simpleLatest and latestComplete elements must point to an existing service element.

To add a "Latest" dataset to our "NAM_12km" dataset, we could add:

<service name="latest" type="Resolver" base="" />

to our catalog and

<addProxies>
<latestComplete name="latestComplete.xml" top="true" serviceName="latest" lastModifiedLimit="60" />
</addProxies>

to our "NAM_12km" datasetScan element. This would result in the following dataset being at the top of the "NAM_12km" collection of datasets:

<dataset name="latestComplete.xml" serviceName="latest" urlPath="latestComplete.xml" />

The latestComplete element includes a name attribute which provides the name of the proxy dataset, the serviceName attribute that references the service used by the proxy dataset, the top attribute which indicates if the proxy dataset should appear at the top or bottom of the list of datasets in this collection, and the lastModifiedLimit which feeds into the algorithm which determines which dataset is being proxied.

The simpleLatest element allows for the same attributes as the latestComplete element minus the lastModifiedLimit attribute. In this case, all the attributes have default values: the name attribute defaults to "latest.xml", the top attribute defaults to "true", and the serviceName attribute defaults to "latest".

Adding Dataset Size Information

The addDatasetSize element indicates that file size metadata should be added to all atomic datasets. Adding

<addDatasetSize />

to a datasetScan element results in the addition of a dataSize element to each atomic dataset:

<dataSize units="Kbytes">6.08</dataSize>

Adding timeCoverage Elements

A datasetScan element may contain an addTimeCoverage element. The addTimeCoverage element indicates that a timeCoverage metadata element should be added to each dataset in the collection and describes how to determine the time coverage for each datasets in the collection.

Currently, the addTimeCoverage element can only construct start/duration timeCoverage elements and uses the dataset name to determine the start time. As described in the "Naming Datasets" section above, the addTimeCoverage element applies a regular expression match to the dataset name. If the match succeeds, any regular expression capturing groups are used in the start time replacement string to build the start time string. The values of the following attributes are used to determine the time coverage:

  1. Either the datasetNameMatchPattern or the datasetPathMatchPattern attribute gives a regular expression used to match on the dataset name or path, respectively. If a match is found, a timeCoverage element is added to the dataset. The match pattern should include capturing groups which allow the match to save substrings from the dataset name.
  2. The startTimeSubstitutionPattern attribute value has all capture group references ("$n") replaced by the corresponding substring that was captured during the match. The resulting string is used as the start value of the resulting timeCoverage element.
  3. The duration attribute value is used as the duration value of the resulting timeCoverage element.

For instance, adding

<addTimeCoverage datasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})f[0-9]{3}.grib2"
                 startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
                 duration="60 hours" />

to a datasetScan element and given a data file named

2005071812f006.grib2

results in the following timeCoverage element:

<timeCoverage>
    <start>2005-07-18T12:00:00</start>
    <duration>60 hours</duration>
</timeCoverage>

Exercise: Add timeCoverage to the Suomi Datasets

  1. Add an addTimeCoverage element to the Suomi datasetScan element that extracts the date/time from the file name and uses the date/time to generate the timeCoverage element (similar to above).

THREDDS This document is maintained by Unidata and was last updated Send comments to THREDDS support.