THREDDS Technical Summary

Overview

THREDDS fundamentally provides middleware services to bridge the gap between data providers and data consumers. We are also involved in developing and enhancing some of the underlying data access software tools, libraries and protocols themselves, as well as influencing how data providers and clients use them.

THREDDS is a key element in support of Unidata 2008 proposal's "Distributed, organized collections of digital material" (endeavor 5), and "Improved data access infrastructure" (endeavor 6).

Accomplishments

Dataset Inventory Catalogs are XML documents that allow a data provider to simply list available on-line datasets. The catalog creator can group datasets into a simple hierarchical classification scheme, which makes a catalog into a “logical data directory”. At a minimum, the catalog specifies the “human readable” dataset name, and how to access it. The catalog also provides a place to add arbitrary metadata about the dataset. We are focusing on enhancing selected datasets by adding space and time bounding boxes, standard names, and data type information. Catalogs can be static XML files, or dynamically generated by Web servers to track continuously changing datasets.

Simple THREDDS Servers are data servers that have Dataset Inventory Catalogs associated with them. The primary focus of THREDDS has been developing these servers in collaboration with our data provider partners. Current servers include ones at IRI/LDEO (Columbia), SSEC (Madison), NOAA-CIRES Climate Diagnostics Center, Fleet Numerical Meteorology and Oceanography Center, and NCAR.

The THREDDS/IDD Server makes much of the real-time data coming in on the Unidata IDD available on a THREDDS server. This includes the NCEP model data, satellite data from NOAAPORT and the Unidata/Wisconsin data streams, NEXRAD Radar, Profiler data from NOAA/FSL, as well as METAR, upper air, buoy, SAO and SHEF hydrology station data. The THREDDS/IDD Server will become part of an enhanced LDM that will be available to the Unidata community of 150 IDD users.

We have worked extensively with OpenDAP/DODS developers, and the next version of OpenDAP servers will have integrated THREDDS Catalogs. We have also developed the THREDDS OpenDAP Aggregation Server which is an OpenDAP data server that aggregates OpenDAP datasets, as well as serving netCDF datasets, and has THREDDS catalogs already integrated. This means that the next generation of OpenDAP servers will automatically be THREDDS servers. The Live Access Server from NOAA/PMEL is a Web server that provides access and visualization of scientific data. It is currently being modified to provide THREDDS catalogs for its data.

Another key THREDDS component for data providers is the Catalog Generator, which scans file directories and generates THREDDS catalogs automatically. This is a highly configurable tool that gives users control over the arrangement and naming of their datasets, adding metadata, extracting information from the datasets, etc. The Catalog Validator provides XML and semantic validation of Catalogs, as well as verification of the datasets themselves.

The ADDE Cataloger is a middleware service that constructs Catalogs for ADDE/Mcidas data servers. It provides “virtual dataset” services, for example, a dataset named “latest” or “last 3 hours”, along with a resolver service to translate a virtual dataset into a list of actual datasets available on the ADDE server. This level of indirection is important for realtime and very large datasets, in order to provide users with the ability to choose datasets of the right granularity.

Dataset Query Capability XML documents are used by middleware services such as the ADDE Cataloger and the THREDDS/IDD Server to specify compactly what datasets are available from a data server. These allow data providers to specify the set of orthogonal choices (for example: station, field, time) that an end-user should make to select from a large and/or real-time collection of datasets. It allows data clients to know how to present appropriate choices to their users in a user interface, without knowing anything specific about the server.

Catalogs are read by the Dataset Searcher, which provides a programmatic interface for searching by space and time bounding boxes, standard names, data type and server type. People can also search for datasets through a web interface. This is a prototype system that will be developed further in the future.

The THREDDS Dataset Exporter creates “resource records” appropriate to add to Digital Libraries such as DLESE, NSDL and GCMD. This prototype system uses special metadata records that are added to the datasets in a catalog, which specify the additional information needed by the DL, such as Dublin Core or DIF formats. The Dataset Exporter uses the Open Archives Initiative (OAI) protocol to send these records into the DLESE and NSDL databases.

THREDDS clients are application programs that know how to read THREDDS Catalogs and know how to read data using some or all of the THREDDS data server types, such as OpenDAP, ADDE, netCDF, etc. The Integrated Data Viewer (IDV), also developed at Unidata, is a full featured analysis program capable of advanced 3D visualization based on the VisAD library. VGEE is an educational content development system build on top of the IDV. New Media Studios is another educational content development framework which uses Macromedia Director and IDL, and is now in the process of being made THREDDS capable. The THREDDS Data Viewer is a tool for debugging data servers and prototyping client software, using the Java client library user interface components and catalog and data access APIs.

A key to successful use of scientific datasets is providing use metadata, especially georeferencing metadata, which allows client software to manipulate and visualize datasets, and to overlay and compare data from different sources. We have helped develop and promulgate georeferencing metadata conventions for netCDF datasets, such as the CF Conventions for model data. We have also developed extensions to the netCDF data model and implemented libraries which automatically recognize and extract georeferencing information in many of the important netCDF and OpenDAP datasets.

We have also developed extensions to the Netcdf Markup Language (NcML) that allows metadata to be added, deleted or changed in netCDF and OpenDAP datasets, as well as to subset or aggregate netCDF files. This capability has been added to the OpenDAP aggregation server, providing a powerful tool for third party metadata augmentation, which is in addition to the ability to add metadata into the Inventory Catalogs.

Status update 08/15/2006

THREDDS Data Server (TDS): We are at stable release 3.12.
1. NetcdfServer: subset NCEP GRIB models and return NetCDF/CF files. (Example) : This is being tested/used by CUAHSI and others.
2. A major rewrite (CrawlableDataset) allows parts of the TDS code to be used in the new OPeNDAP Server 4 to generate THREDDS catalogs as an integral part of OPeNDAP servers.
3. OAI harvesting added, both DIF and ADN records improved. Motherlode records exported to GCMD, DLESE.
4. Improvements on InvDatasetScan:
  1. Generates last modified dates in catalogs which allows HTML view to display the date.
  2. "Latest" dataset now determined by file name and last modified time to give incoming files time to finish arriving.
5. Continue to test and improve TDS/NcML Aggregation along with partners such as Pacific Fisheries Environmental Laboratory (Roy Mendelson)
6. WCS Server was successfully used in the GALEON experiment for "WCS gateways to netCDF datasets".
7. HTML catalog view now correctly sets the Last Modified field in dataset listings.
Common Data Model: We are at stable release 2.2.16
1. BUFR files are being decoded into the CDM, ongoing work to improve and add tables.
2. New Radar Datatype interface and implementations for DORADE, NEXRAD 2 and NEXRAD 3 integrated into IDV release 2.0.
3. Users can now plug in their own coordinate transforms (CoordTransBuilder).
Ongoing work:
1. New kind of NcML Aggregation: Forecast Model Run Collection, for gridded data. GeoGrid will be extended to handle 2 time dimensions, and possibly also an ensemble dimension.
2. GRIB files:
  1. handle ensemble and GRIB2 error variables
  2. Standardize coordinates across runs.
3. Working to standardize "Dapper Conventions" for OPeNDAP sequences and nested Structures.
4. The Unidata/LEAD project is integrating the TDR (THREDDS Data Repository) with the TDS.