Unidata - To provide the data services, tools, and cyberinfrastructure leadership that advance Earth system science, enhance educational opportunities, and broaden participation. Unidata
         
  advanced  
 

Dataset Inventory Catalog Primer

last update: June 1, 2004


Introduction

Here's an example of a very simple catalog:

 1 <?xml version="1.0" ?> 
 2 <catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" >
 3   <service name="aggServer" serviceType="DODS"  base="http://acd.ucar.edu/dodsC/" />
 4   <dataset name="SAGE III Ozone Loss" serviceName="aggServer" urlPath="sage.nc"/>
 5 </catalog>

with this line-by-line explanation:

  1. The first line indicates that its an XML document.
  2. This is the root element of the XML, the catalog element. It must declare the thredds catalog namespace with the xmlns attribute exactly as shown.
  3. This declares a service with name aggServer. It is a DODS (OpenDAP) server whose dataset URLs all start with http://acd.ucar.edu/dodsC/.
  4. This declares a dataset whose name is SAGE III Ozone Loss. It references the aggServer service, and so its full URL will be http://acd.ucar.edu/dodsC/sage.nc.
  5. This closes the catalog element.

Nested datasets

Usually you have many datasets to declare in each catalog, which you do using nested datasets:

 <?xml version="1.0" ?> 
 <catalog name="Example" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" >
   <service name="aggServer" base="http://acd.ucar.edu/dodsC/" serviceType="DODS" />

1  <dataset name="SAGE III Ozone Loss Experiment" >
2     <dataset name="January Averages" serviceName="aggServer" urlPath="sage/avg/jan.nc"/>
2     <dataset name="February Averages" serviceName="aggServer" urlPath="sage/avg/feb.nc"/>
2     <dataset name="March Averages" serviceName="aggServer" urlPath="sage/avg/mar.nc"/>
3  </dataset>

 </catalog>
  1. This now declares a collection dataset which just acts as a container for the other datasets. Note that is ends in a > instead of />.
  2. These are the datasets that directly point to data, called direct datasets.
  3. This closes the collection dataset element on line 1.

You can add any level of nesting you want, eg:

<?xml version="1.0" ?> 
<catalog name="Example" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" >
 <service name="aggServer" base="http://acd.ucar.edu/dodsC/" serviceType="DODS" />

 <dataset name="SAGE III Ozone Loss Experiment" >

  <dataset name="Monthly Averages" >
   <dataset name="January Averages" serviceName="aggServer" urlPath="sage/avg/jan.nc"/>
   <dataset name="February Averages" serviceName="aggServer" urlPath="sage/avg/feb.nc"/>
   <dataset name="March Averages" serviceName="aggServer" urlPath="sage/avg/mar.nc"/>
  </dataset>

  <dataset name="Daily Flight Data" >
   <dataset name="January">
     <dataset name="Jan 1, 2001" serviceName="aggServer" urlPath="sage/daily/20010101.nc"/>
     <dataset name="Jan 2, 2001" serviceName="aggServer" urlPath="sage/daily/20010201.nc"/>
   </dataset>
  </dataset>

 </dataset>
</catalog>

More dataset information

There's a lot of other information that can be optionally added that helps applications and digital libraries know how to "do the right thing" with the dataset. The collectionType attribute is used on collection datasets. The dataType is a simple classification (eg Image, Grid, Point data, etc). The dataFormatType describes what format the data is stored in (eg NetCDF, HDF5, etc) used by a file transfer protocol like FTP. The combination of the naming authority and the ID attribute should form a globally unglue identifier for a dataset.

<dataset name="SAGE III Ozone Loss Experiment" collectionType="TimeSeries">
  <dataset name="January Averages" serviceName="aggServer" urlPath="sage/avg/jan.nc" authority="unidata.ucar.edu" ID="sage-20938483">
	 <dataType>Trajectory</dataType>
	 <dataFormatType>NetCDF</dataFormatType>
  </dataset>
</dataset>

The harvest attribute indicates that the dataset is at the right level of granularity to be exported to search systems like Digital Libraries. Elements such as summary, rights, publisher are needed in order to create valid entries for these services. For more details, see Exporting THREDDS Datasets to Digital Libraries. Also see the Catalog Specification as a complete reference.

<dataset name="SAGE III Ozone Loss Experiment" harvest="true">
  <contributor role="data manager">John Smith</contributor>
<keyword>Atmospheric Chemistry</keyword>
<publisher>
<name vocabulary="DIF">Community Data Portal, National Center for Atmospheric Research, University Corporation for Atmospheric Research</long_name> <contact url="http://dataportal.ucar.edu" email="cdp@ucar.edu"/>
</publisher>
</dataset>

Factoring out information

Rather than declare the same information on each dataset, you can use the metadata element to factor out common information.:

<dataset name="SAGE III Ozone Loss Experiment" >

1 <metadata inherit="true">
2   <serviceName>aggServer</serviceName>
2   <dataType>Trajectory</dataType>
2   <dataFormatType>NetCDF</dataFormatType>
2   <authority>unidata.ucar.edu</authority>
  </metadata>

3 <dataset name="January Averages" urlPath="sage/avg/jan.nc" ID="sage-23487382"/>
3 <dataset name="February Averages" urlPath="sage/avg/feb.nc" ID="sage-63656446"/>
4 <dataset name="Global Averages" urlPath="sage/global.nc" ID="sage-7869700g" dataType="Grid"/>

</dataset>
  1. The metadata element with inherit=true implies that all the information inside the metadata element applies to the current dataset and all nested datasets.
  2. The serviceName, dataType, dataFormatType and authority are declared as elements.
  3. These datasets now use the serviceName, dataType, dataFormatType and authority values declared in the parent dataset.
  4. This dataset uses the serviceName, dataFormatType and authority values and overrides the dataType.

More Advanced Topics

XML Namespaces and Validation

If you use elements from other namespaces, you must declare those namespaces in the catalog element. Currently there are two other namespaces THREDDS libraries will recognize, Dublin Core, and XLink, whose namespaces look like:

<catalog name="MyName"
    xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" 
    xmlns:dc="http://purl.org/dc/elements/1.1/"  
    xmlns:xlink="http://www.w3.org/1999/xlink" >

Its not obvious, but namespaces are not web addresses, they are just strings that need to be copied exactly as you see them here.

As catalogs get more complicated, you should check that you haven't made any errors. There are three parts to checking:

  1. Is the XML well-formed?
  2. Is it valid against the catalog schema?
  3. Does it have everything it needs to be read by a THREDDS client?

You can use any THREDDS validation service, such as this one to check all three of these.

You can check well-formedness using an XML tool like XMLSpy; in order to check validity in those tools you will need to declare the catalog schema location like this:

<catalog name="MyName"
  xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" 
1 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
2 xsi:schemaLocation="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0 http://www.unidata.ucar.edu/schemas/thredds/InvCatalog.1.0.xsd"> 
...
</catalog>
  1. This line declares the schema-instance namespace. Just copy it exactly as you see it here.
  2. This line tells your XML validation tool where to find the thredds schema. Just copy it exactly as you see it here.

The THREDDS validation service, as well as the catalog library, knows where the schemas are located, so you only need to add these 2 lines if you want to do your own validation.

Catalog References

It can be useful to break up large catalogs into pieces in order to separately maintain each piece. One way to do this is to use build each piece as a separate and logically complete catalog, then create a master catalog using catalog references:

<catalog name="master" 
	xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" 
   xmlns:xlink="http://www.w3.org/1999/xlink" >

 <dataset  name="List of THREDDS catalogs">
   <catalogRef xlink:title="IRI/LDEO Climate Data Library" xlink:href="http://iridl.ldeo.columbia.edu/SOURCES/thredds.xml"/>
   <catalogRef xlink:title="NCAR Data Portal" xlink:href="http://dataportal.ucar.edu/metadata/ucar.thredds"/>
   <catalogRef xlink:title="NOAA-CIRES Climate Diagnostics Center" xlink:href="http://www.cdc.noaa.gov/THREDDS/catalog.xml"/>
   <catalogRef xlink:title="Unidata THREDDS-IDD Server" xlink:href="http://motherlode.ucar.edu:8080/thredds/catalog.xml"/>
   <catalogRef xlink:title="University of Alabama Huntsville POND server" xlink:href="http://pond.itsc.uah.edu/catalog/thredds/pond_cat.xml"/>
  </dataset>
</catalog>

In this example we have several catalogRef elements, each with a link to an external catalog, using the xlink:href attribute. The catalogRef should be thought of as a dataset, whose contents are the contents of the external catalog. The xlink:title is used as the name of the dataset. Notice that we must declare the xlink namespace in the catalog element.


CVS date: $Date: 2003/12/24 00:00:04 $

 

 

 

 

 
 
  Contact Us     Site Map     Search     Terms and Conditions     Privacy Policy     Participation Policy
 
National Science Foundation (NSF) UCAR Office of Programs University Corporation for Atmospheric Research (UCAR)   Unidata is a member of the UCAR Office of Programs, is managed by the University Corporation for Atmospheric Research, and is sponsored by the National Science Foundation.
P.O. Box 3000     Boulder, CO 80307-3000 USA     Tel: 303-497-8643     Fax: 303-497-8690