TDS Configuration Catalogs



Overview

THREDDS catalogs were originally designed as simple catalogs of remote datasets. They associated human-readable names with data access URLs and allowed both a hierarchical organization and the addition of metadata. Thus providing client applications with information for accessing remote datasets (as we saw earlier with the ToolsUI and IDV applications). [More information is available from the THREDDS catalog primer and specification document.]

In this section, we will take a look at the extensions to THREDDS catalogs that allow the TDS to use them for configuration. We call catalogs that use these extensions TDS Configuration Catalogs or Server-side Catalogs. They represent the top-level catalogs the TDS will serve, contain information detailing the datasets the TDS will serve, and indicate which services will be available for each dataset. All the configuration information is only needed by the server and is removed or transformed for the client view of the catalog.


Serving Datasets

In a client-side catalog, an access URL can be constructed for a dataset if the dataset: 1) references a service element, and 2) has a urlPath attribute or access child element. The service element provides a way to factor out access information from dataset elements.

To handle a data access request, the TDS needs enough configuration information so that it can map an incoming request URL to a location on local disk. In the TDS configuration files, the datasetRoot and datasetScan elements perform this function.

Associating a Dataset with a service Element

Looking at our main TDS config catalog, catalog.xml:

(1)  <service name="thisDODS" serviceType="OpenDAP" base="/thredds/dodsC/" />
...

(2a) <dataset name="Test Single Dataset" ID="testDataset" serviceName="thisDODS"
(3a) urlPath="test/testData.nc" />

<datasetScan name="NCEP GFS models" ID="model/NCEP/GFS"
(3b) path="model/NCEP/GFS" location="/data/ldm/pub/native/grid/NCEP/GFS">

<metadata inherited="true">
(2b) <serviceName>thisDODS</serviceName>
</metadata>

...
</datasetScan>

The service is defined at (1) with the name "thisDODS". The service is referenced at (2a) and (2b) using the serviceName attribute and element, respectively. Notice that the reference at (2b) is in an inherited metadata element which means any descendant dataset elements would also reference this service. The second part of making a dataset accessible is to specify the dataset URL that gets appended to the service base URL. In the above example, this is done at (3a) and (3b). Though (3b) is in a server-side datasetScan element so gets expanded when catalog requests are made to the server.

datasetRoot Element

The datasetRoot element provides a mapping between a base request URL and a data location that can be used with individual datasets.

Revisiting our current TDS configuration catalog:

     <service name="thisDODS" serviceType="OpenDAP" base="/thredds/dodsC/" />
(1) <datasetRoot path="test" location="content/testdata/"/>

<dataset name="Test Single Dataset" ID="testDataset" serviceName="thisDODS"
(2) urlPath="test/testData.nc"/>

We can see that the dataset has an OPeNDAP access URL of /thredds/dodsC/test/testData.nc constructed from the service base URL and the dataset url path. The datasetRoot element defines the request URL segment ("test") that it is associating with the a location on local disk ("content/testdata/" which is a special shortcut to ${TOMCAT_HOME}/content/thredds/public/testdata/).  The TDS knows that the dataset uses the URL/location association defined by this datasetRoot element because the urlPath of the dataset (2) starts with the path of the datasetRoot (1).

Looking at the client-side view of the catalog (http://localhost:8080/thredds/catalog.xml), notice that the datasetRoot element is not included.

Let's try an example

  1. Take a look in the /data directory to find a single file to serve:
  2. Change the datasetRoot to point to the /data/idv/trajectory directory
  3. Change the dataset to reference the trajectory data file.
  4. Reinit the TDS
  5. Browse the catalog again

Note: Remember the value of the location attribute must be an absolute path (except for the special case of the "content" shortcut).


datasetScan Element

The datasetScan element provides a mapping between a base request URL and a data location that must reference an entire collections of datasets (i.e., for a local disk, the location must reference a directory). In the client-view of the catalog, a datasetScan element is shown as a catalogRef element. The generation of the catalog for the collection is actually differed till a request is made for that catlaog. When the catalog is requested the location directory is scanned, directories are represented as catalogRef elements and files are represented as dataset elements. The scanning of each subdirectory is defered till a request is made for the corresponding catalog.

Again, back to our current TDS configuraiton catalog:

     <service name="thisDODS" serviceType="OpenDAP" base="/thredds/dodsC/" />
...
<datasetScan name="NCEP GFS models" ID="model/NCEP/GFS"
(1) path="model/NCEP/GFS" location="/data/ldm/pub/native/grid/NCEP/GFS">
<serviceName>thisDODS</serviceName>
...
</datasetScan>
The path attribute on the datasetScan element is the part of the URL that identifies this datasetScan and is used to map data access URLs to a location on local disk. The location attribute on the datasetScan element provides the location of the dataset collection on the local file system (it must be a directory and should be an absolute path).

Let's inspect the resulting catalogs:
  1. Look at the client-view of the catalog  to see the catalogRef element that represents the data collection given by the location\
  2. Follow the catalogRef link to the catalog for that collection

Now that we've seen the details of the resulting XML, let's look at the catalog structure generated:

  1. Check out what is on disk
  2. Browse the catalogs

Note: Data root paths must be unique across a TDS. Because the TDS uses the set of all given path values to map URLs to datasets, each path value MUST be unique across all config catalogs on a given TDS installation. Duplicates will cause warning messages in the catalogErrors.log file.


service Element

The TDS provides the several data services including an OPeNDAP server, an HTTP bulk file download service, and a WCS service.

The URLs to access these services start with the TDS context name ("thredds") and the appropriate servelet name (e.g., "dodsC"). Because of this, the base attribute of the corresponding service elements must be exactly as follows:

OPeNDAP server:

  <service name="ncdods" serviceType="OPeNDAP" base="/thredds/dodsC/" />

HTTP bulk file server :

  <service name="fileServer" serviceType="HTTPServer" base="/thredds/fileServer/" />

WCS Server :

  <service name="wcsServer" serviceType="WCS" base="/thredds/wcs/" />

You can use whatever name you choose for the service, they only needs to match the ones used in the dataset serviceName. Note that the base URLs are relative, so your catalogs are independent of your server hostname and port.

Serving Datasets with Multiple Methods

Datasets can be made available through more than one access method by defining and then referencing a compound service element. For instance:

     <service name="multiService" serviceType="Compound" base="" >
<service name="thisDODS" serviceType="OpenDAP" base="/thredds/dodsC/" />
<service name="wcsServer" serviceType="WCS" base="/thredds/wcs/" />
</service>

defines a compound service named "multiService" which contains two nested services. Any dataset that reference the compound service will have two access methods. So the dataset:

    <dataset name="cool data" serviceName="multiService" urlPath="so/cool/data.nc" />

would have these two access URLs:

Exercise: Add WCS service to GFS model data

  1. Add a WCS service to our current OPeNDAP only GFS model data
  2. Change the existing OPeNDAP service to a compound service similar to the one above
  3. Reinit the TDS
  4. Check catalogErrors.log
  5. Browse the catalog again

Note: In a given catalog, the names of service elements must be unique.


More datasetScan Element

Including Only the Desired Files

A datasetScan element can specify which files and directories it will include with a filter element (see spec for more details). When no filter element is given, all files and directories are included in the generated catalog(s). Adding a filter element to your datasetScan element allows you to include (and/or exclude) the files and directories as desired. We saw a simple example earlier when we configured our TDS to serve the GFS model data:

      <filter>
<include wildcard="GFS*.grib1" />
</filter>

The include and exclude elements both determine which datasets they match on whether their wildcard pattern (given by the wildcard attribute) or regular expression (given by the regExp attribute) match the dataset name. By default, includes and excludes apply only to regular files (atomic datasets). You can specify that they apply to directories (collection datasets) as well by using the atomic and collection attributes. For example, I can exclude all the GFS Alaska 191km model data by adding the following exclude element to the above filter:

        <exclude wildcard="Ensemble_1p25deg" atomic="false" collection="true" />

 Let's try:

  1. Edit the main config catalog:
  2. Add the above exclude element to the existing filter
  3. Reinit the TDS
  4. Check catalogErrors.log
  5. Browse the catalog again

Generating IDs

All generated datasets are given an ID. The IDs are simply the path of the dataset appended to the datasetScan path value or, if one exists, the ID of the datasetScan element. So, for the GFS/Alaska_191km directory and our current configuration:

     <datasetScan name="NCEP GFS models" ID="model/NCEP/GFS"
path="model/NCEP/GFS" location="/data/ldm/pub/native/grid/NCEP/GFS">

the value of the dataset ID would be "model/NCEP/GFS/Alaska_191km".

Let's try changing the ID for this dataset:

  1. Edit the main config catalog:
  2. Change the ID value
  3. Reinit the TDS
  4. Check catalogErrors.log
  5. Browse the catalog again

Naming Datasets

By default, all datasets are named with the corresponding file name. By adding a namer element, you can specify more human readable dataset names. The following namer looks for the dataset named "Alaska_191km" and renames it with the replace string:

<namer>
<regExpOnName regExp="Alaska_191km" replaceString="NCEP GFS Alaska 191km model data" />
</namer>

More complex renaming is possible as well. The namer uses a regular expression match on the dataset name. If the match succeeds, any regular expression capturing groups are used in the replacement string.

A capturing group is a part of a regular expression enclosed in parenthesis. When a regular expression with a capturing group is applied to a string, the substring that matches the capturing group is saved for later use. The captured strings can then be substituted into another string in place of capturing group references,"$n", where "n" is an integer indicating a particular capturing group. (The capturing groups are numbered according to the order in which they appear in the match string.) For example, the regular expression "Hi (.*), how are (.*)?" when applied to the string "Hi Fred, how are you?" would capture the strings "Fred" and "you". Following that with a capturing group replacement in the string "$2 are $1." would result in the string "you are Fred."

Here's an example namer:

<namer>
<regExpOnName regExp="GFS_Alaska_191km_([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2})"
replaceString="NCEP GFS 191km Alaska $1-$2-$3 $4:$5:00 GMT"/>
</namer

the regular expression has five capturing groups

  1. The first capturing group, "([0-9]{4})",  captures four digits, in this case the year.
  2. The second capturing group, "([0-9]{2})", captures two digits, in this case the month.
  3. The third capturing group, "([0-9]{2})", captures two digits, in this case the day of the month.
  4. The fourth capturing group, "([0-9]{2})", captures two digits, in this case the hour of the day.
  5. The fifth capturing group, "([0-9]{2})", captures two digits, in this case the minutes of the hour.

When applied to the dataset name "GFS_Alaska_191km_20051011_0000.grib1",  the strings "2005", "10", "11", "00", and "00" are captured. After replacing the capturing group references in the replaceString attribute value, we get the name "NCEP GFS 191km Alaska 2005-10-11 00:00:00 GMT".

To try it on our datasetScan element:

  1. Edit the main config catalog:
  2. Add the above namer element to the datasetScan element.
  3. Reinit the TDS
  4. Check catalogErrors.log
  5. Browse the catalog again

You could add namer elements for each subdirectory under GFS. However, that setup would causes every namer to be tried on every dataset under the GFS directory. One way to get around this problem would be to split the datasets out with a datasetScan element for each subdirectory.

So, we can:

  1. Edit the main config catalog:
  2. Copy the existing datasetScan element.
  3. Change one of the datasetScan elements to serve the "Alaska_191km" dataset.
  4. Add a filter to the existing datasetScan element to exclude the "Alaska_191km" data.
  5. Reinit the TDS
  6. Check catalogErrors.log
  7. Browse the catalog again

Note: Though the data root paths must be unique, they can be extensions of an existing path. The TDS looks for the path that has the longest match in a request URL.


Sorting Datasets

A sort element can be added to a datasetScan to specify the order in which a collection of datasets are listed. Without a sort element, datasets at each collection level are listed in their "natural" order. Currently, the only supported sort algorithm sorts datasets lexigraphically by name either in increasing or decreasing order. Here's what a sort element looks like:

<sort>
<lexigraphicByName increasing="false" />
</sort>

Exercise:

  1. Edit the main config catalog:
  2. Add the above sort element to the "Alaska_191km" datasetScan element.
  3. Reinit the TDS
  4. Check catalogErrors.log
  5. Browse the catalog again

Notes (some underlying implementation details):
  1. The "natural" order of the datasets is determined by the order returned by the listDatasets() method in CrawlableDataset.
  2. The sort is done on the CrawlableDataset list. The naming discussed in the previous section is done to the resulting InvDataset. Therefore, the naming discussed above does not affect the sort order.

Adding a "Latest" Proxy Datasets

With a real-time archive, it is convenient to define a "proxy" dataset that always points to the most recent dataset in a collection. Other types of proxy datasets may be useful as well and the addProxies element provides a place for describing proxy datasets. Currently, only two addProxies child elements are defined. They are both "Latest" proxy elements. The simpleLatest element adds a proxy dataset which proxies the existing dataset whose name is lexigraphically greatest (which finds the latest dataset assuming a timestamp is part of the dataset name). The latestComplete element behaves similarly to simpleLatest except that the proxied dataset does not include any datasets that have been modified more recently than a given time limit, e.g., you could specify you want the most recent (lexigraphically) dataset that hasn't been modified for 60 minutes. Both the simpleLatest and latestComplete elements must point to an existing service element.

To add a "Latest" dataset to our "Alaska_191km" dataset, we could add:

  <service name="latest" type="Resolver" base="" />

to our catalog and

    <addProxies>
<latestComplete name="latestComplete.xml" top="true" serviceName="latest" lastModifiedLimit="60" />
</addProxies>

to our "Alaska_191km" datasetScan element. This would result in the following dataset being at the top of the "Alaska_191km" collection of datasets:

    <dataset name="latestComplete.xml" serviceName="latest" urlPath="latestComplete.xml" />

The latestComplete element includes a name attribute which provides the name of the proxy dataset, the serviceName attribute that references the service used by the proxy dataset, the top attribute which indicates if the proxy dataset should appear at the top or bottom of the list of datasets in this collection, and the lastModifiedLimit which feeds into the algorithm which determines which dataset is being proxied.

The simpleLatest element allows for the same attributes as the latestComplete element minus the lastModifiedLimit attribute. In this case, all the attributes have default values: the name attribute defaults to "latest.xml", the top attribute defaults to "true", and the serviceName attribute defaults to "latest".

Adding Dataset Size Information

The addDatasetSize element indicates that file size metadata should be added to all atomic datasets. Adding

  <addDatasetSize />

to a datasetScan element results in the addition of a dataSize element to each atomic dataset:


<dataSize units="Kbytes">6.08</dataSize>

Adding timeCoverage Elements

A datasetScan element may contain an addTimeCoverage element. The addTimeCoverage element indicates that a timeCoverage metadata element should be added to each dataset in the collection and describes how to determine the time coverage for each datasets in the collection.

Currently, the addTimeCoverage element can only construct start/duration timeCoverage elements and uses the dataset name to determine the start time. As described in the "Naming Datasets" section above, the addTimeCoverage element applies a regular expression match to the dataset name. If the match succeeds, any regular expression capturing groups are used in the start time replacement string to build the start time string.These attributes values are used to determine the time coverage:

  1. The datasetNameMatchPattern attribute value is used for a regular expression match on the dataset name. If a match is found, a timeCoverage element is added to the dataset. The match pattern should include capturing groups which allow the match to save substrings from the dataset name.
  2. The startTimeSubstitutionPattern attribute value has all capture group references ("$n") replaced by the corresponding substring that was captured during the match. The resulting string is used as the start value of the resulting timeCoverage element.
  3. The duration attribute value is used as the duration value of the resulting timeCoverage element.

Adding

  <addTimeCoverage datasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2}).grib1$"
startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
duration="60 hours" />

to a datasetScan element results in the following timeCoverage element:

  <timeCoverage>
<start>2005-07-18T12:00:00</start>
<duration>60 hours</duration>
</timeCoverage>

This document is maintained by Unidata and was last updated on July 20, 2007. Send comments to THREDDS support.