THREDDS is needed because of the differences between data providers and data consumers

THREDDS Technical status report

May 12, 2003

Overview

THREDDS fundamentally provides middleware services to bridge the gap between data providers and data consumers. This has been done by developing tools and services for both data providers and consumers, as well as services that sit between providers and consumers. We are also involved in developing and enhancing some of the underlying data access software tools, libraries and protocols themselves, as well as influencing how data providers and clients use them.

Accomplishments for Current Funding Cycle

Dataset Inventory Catalogs are XML documents that allow a data provider to simply list available on-line datasets. The catalog creator can group datasets into a simple hierarchical classification scheme, which makes a catalog into a “logical data directory”. At a minimum, the catalog specifies the “human readable” dataset name, and how to access it. The catalog also provides a place to add arbitrary metadata about the dataset. We are focusing on enhancing selected datasets by adding space and time bounding boxes, standard names, and data type information. Catalogs can be static XML files, or dynamically generated by Web servers to track continuously changing datasets.

THREDDS Servers are data servers that have Dataset Inventory Catalogs associated with them. The primary focus of THREDDS has been developing these servers in collaboration with our data provider partners. Current servers include ones at IRI/LDEO (Columbia), SSEC (Madison), NOAA-CIRES Climate Diagnostics Center, and NCAR. Currently there are over a dozen inventory catalogs available on four THREDDS servers with more than 10,000 datasets listed in the catalogs. We expect those numbers to double by the end of the funding period.

The THREDDS/IDD Server makes much of the real-time data coming in on the Unidata IDD available on a THREDDS server. This includes the NCEP model data, satellite data from NOAAPORT and the Unidata/Wisconsin data streams, NEXRAD Radar, Profiler data from NOAA/FSL, as well as METAR, upper air, buoy, SAO and SHEF hydrology station data. The THREDDS/IDD Server will become part of an enhanced LDM that will be available to the Unidata community of 150 IDD users.

We have worked extensively with OpenDAP/DODS developers, and the next version of OpenDAP servers will have integrated THREDDS Catalogs. We have also developed the THREDDS OpenDAP Aggregation Server which is an OpenDAP data server that aggregates OpenDAP datasets, as well as serving netCDF datasets, and has THREDDS catalogs already integrated. This means that the next generation of OpenDAP servers will automatically be THREDDS servers. The Live Access Server from NOAA/PMEL is a Web server that provides access and visualization of scientific data. It is currently being modified to provide THREDDS catalogs for its data.

Another key THREDDS component for data providers is the Catalog Generator, which scans file directories and generates THREDDS catalogs automatically. This is a highly configurable tool that gives users control over the arrangement and naming of their datasets, adding metadata, extracting information from the datasets, etc. The Catalog Validator provides XML and semantic validation of Catalogs, as well as verification of the datasets themselves.

The ADDE Cataloger is a middleware service that constructs Catalogs for ADDE/Mcidas data servers. It provides “virtual dataset” services, for example, a dataset named “latest” or “last 3 hours”, along with a resolver service to translate avirtual dataset into a list of actual datasets available on the ADDE server. This level of indirection is important for realtime and very large datasets, in order to provide users with the ability to choose datasets of the right granularity.

Dataset Query Capability XML documents are used by middleware services such as the ADDE Cataloger and the THREDDS/IDD Server to specify in a succinct way what datasets are available from a data server. These allow data providers to specify the set of orthogonal choices (for example: station, field, time) that an end-user should make to select from a large and/or real-time collection of datasets. It allows data clients to know how to present appropriate choices to their users in a user interface, without knowing anything specific about the server.

Catalogs are read by the Dataset Searcher, which provides a programmatic interface for searching by space and time bounding boxes, standard names, data type and server type. People can also search for datasets through a web interface. This is a prototype system that will be developed further in the future.

The THREDDS Dataset Exporter creates “resource records” appropriate to add to Digital Libraries such as DLESE, NSDL and GCMD. This prototype system uses special metadata records that are added to the datasets in a catalog, which specify the additional information needed by the DL, such as Dublin Core or DIF formats. The Dataset Exporter uses the Open Archives Initiative (OAI) protocol to send these records into the DLESE and NSDL databases. (Q: How does it send into GCMD?)

THREDDS clients are application programs that know how to read THREDDS Catalogs and know how to read data using some or all of the THREDDS data server types, such as OpenDAP, ADDE, netCDF, etc. The Integrated Data Viewer (IDV), also developed at Unidata, is a full featured analysis program capable of advanced 3D visualization based on the VisAD library. VGEE is an educational content development system build on top of the IDV. New Media Studios is another educational content development framework which uses Macromedia Director and IDL, and is now in the process of being made THREDDS capable. The THREDDS Data Viewer is a tool for debugging data servers and prototyping client software, using the Java client library user interface components and catalog and data access APIs. We expect to use this library to THREDDS-enable the OpenDAP Data Connector software and other Java clients.

A key to successful use of scientific datasets is providing use metadata, especially georeferencing metadata, which allows client software to manipulate and visualize datasets, and to overlay and compare data from different sources. We have helped develop and promulgate georeferencing metadata conventions for netCDF datasets, such as the CF Conventions for model data. We have also developed extensions to the netCDF data model and implemented libraries which automatically recognize and extract georeferencing information in many of the important netCDF and OpenDAP datasets.

We have also developed extensions to the Netcdf Markup Language (NcML) that allows metadata to be added, deleted or changed in netCDF and OpenDAP datasets, as well as to subset or aggregate netCDF files. This capability has been added to the OpenDAP aggregation server, providing a powerful tool for 3^rd party metadata augmentation, which is in addition to the ability to add metadata into the Inventory Catalogs.