THREDDS Relationship to Other Unidata Projects


Draft by Ben Domenico
Last Modified: April 26, 2007

(Note that this is a work in progress, but it is being made available in its current form with the idea that some readers will find parts of it useful)

THREDDS (THematic Real-time Environmental Distributed Data Services):

What's in it for me?

Among the motivations for developing THREDDS are:

 

 

While Unidata's primary focus has always been on providing real-time data and tools for analyzing that data on computer systems in university departments, there has long been a community desire for "seamless" access to retrospecitive data as well. The idea is that researchers, educators, and students in the Unidata community would benefit from being able to access archived data using the same local tools they use for analysis and display of real-time data. As early as 1989, Unidata had articulated a goal to "maximize the effectiveness of Unidata systems for analyzing historical data from the major national archive centers." As detailed in later sections, hundreds of data provider sites (NCAR, NCDC, and FNMOC among them) have implemented THREDDS technologies to make their data available to users with THREDDS-enabled applications programs such as the Unidata Integrated Data Viewer (IDV).

 

In addition to the collections in data centers, many Unidata sites were establishing their own collections and sought to share their own data in a more convenient fashion than the "needdata" email list. The Unidata 2003 proposal which was actually written in 1997 includes a goal of “utilizing the aggregate data holdings of all Unidata sites as a common resource, accessed via the Internet." This not only allows other sites to gain access to data they may have missed, but allows some institutions to access the data without having their own IDD systems as noted in the following paragraph.

 

Unidata's IDD serves large number and wide range of departments, it was also recognized that there were departments and institutions that simply did not need and could not support the full power of the IDD "firehose." In some cases, a professor might only teach the meteorology course every other semester. In other cases, the need to administer a Unix computer in order to become a full-blown node on the Unidata IDD was a substantial barrier. Client/server technologies provided a simpler "buy in" that allows participation by wider range of institutions. As an example, California University of Pennsylvania doesn't have an LDM/IDD feed, but uses the Unidata IDV to access data served on remote sites.  The IDV display below renders one type of data available via THREDDS services. In this case, the GFS numerical weather prediction model output of mean sea level pressure (as color-shaded image and contour lines) and 50 m/s wind speed isosurfaces showing the jet streams.

 

IDV Display of Global Forecast

IDV Globe Display of GFS numerical weather prediction model output

 

Finally there was a realization that the geosciences were moving toward more interdisciplinary research and education programs where an Earth system approach emphasized the difficult research topics at the boundaries of the traditional disciplines. It quickly became apparent that these disciplines had widely different data systems and, in fact, had fundamentally different ways of thinking about their datsets. In particular, the solid Earth and hydrology communities made extensive use of GIS (Geographic Information Systems) which are built upon relational database technology. In order to facilititate data sharing with the hydrology, coastal oceans, and human impacts communities, it would be necessary to establish interoperability with their data systems. To gain a sense of power afforded by integrating the tools of both communities, the ESRI arcGIS screen shot below shows demographic information as an overlay on real-time, high resolution, local forecast models being run in regions of high precipitation probability and served via THREDDS technologies.

 

Schools and Hospitals in High Precip Region
Figure 4: ArcMap Rendering Schools and Hospitals in Region of High Forecast Precipitation

 

The forecast was generated as part of the LEAD (Linked Environments for Atmospheric Discovery) Large ITR project by the WRF (Weather Research Forecast) model on a LEAD node at Unidata. ArcMap tools were used to determine which schools and hospitals were in the areas where the high resolution WRF forecast storm total precipitation in excess of 7 inches. This illustrates the power of combining real-time (hourly) weather forecast data with the database tools of the GIS community. Similar overlays could be done with GIS data representing watershed and drainage basins to combine the atmospheric predictions with hydrological measurements.

 

All these motivations have come into play during the evolution of THREDDS.

What is THREDDS?

THREDDS (Thematic Real-time Environmental Distributed Data Services) is middleware to bridge the gap between data usere and remote data providers. The goal is to simplify the discovery and use of scientific data and to allow scientific publications and educational materials to reference scientific data.

 

THREDDS’ initial focus was to allow data users to find datasets that are pertinent to their specific education and research needs, access the data, and use them without necessarily downloading the entire file to their local system. To achieve this, we needed a way for data providers to publish lists of what data are available and to describe their data to enable discovery and use.

 

Catalogs are the heart of the THREDDS concept. They are XML documents that describe on-line datasets. Catalogs can contain arbitrary metadata, and we have also defined a standard set of metadata to bridge to discovery centers like GCMD, DLESE and NSDL.

 

The THREDDS Catalog Generator produces THREDDS catalogs by scanning or crawling one or more local or remote dataset collections. Catalogs can be generated periodically or on demand, using configuration files that control what directories get scanned, and how the catalogs are created.

Brief History: Relationship to NSDL, DODS/OPenDAP, and McIDAS ADDE

THREDDS was initially funded as a National Science Digital Library (NSDL) Collections project.  The idea was to develop a technology that would complement the client/server approach to data access that had been developed by Distributed Oceanographic Data System (DODS at that time, now OPeNDAP).  OPeNDAP is an internet client/server protocol that allows client application programs (like IDL, Matlab and the IDV) to access datasets on remote servers as if the datasets were on the local disk of the workstation.  Where a client application program would normally take the name of a local file, the user simply has to supply a URL pointing to a dataset on a remote OPeNDAP server.  The client program then operates on the remote dataset as if it were a local file. 

 

At about the same time, the McIDAS development team at the University of Wisconsin Madison's SSEC (Space Science and Engineering Center) were augmenting the McIDAS system with a client/server interface called ADDE (Abstract Data Distribution Environment), a remote data access protocol originally developed for geolocated data that communicates requests from client applications to servers, which then return data objects back to the client. Since the both the client and server interfaces were part of the McIDAS distribution, it was easy for Unidata sites to simply turn on the McIDAS ADDE server and make their datasets available to sites with the McIDAS client software. Shortly after this version of McIDAS was distributed by Unidata, several dozen Unidata sites became data servers via McIDAS ADDE.

 

Thus remote access protocols such as OPeNDAP and ADDE made it possible for data provider sites to make their datasets available via the internet to users at other sites whose client software were instrumented to use the protocol for accessing the data from remote machines. The general idea was that, instead of specifying the name file containing a dataset on the local network the user of the client software could specify an internet URL that represented on a remote server machine somewhere on the internet. The advent of these protocols was a significant step toward giving Unidata users "seamless access to retrospective data" as well as the ability to "share their own data."

 

This worked out great for people who happened to know the URL names of datasets on remote servers. However, there was still a problem in that there was no internet equivalent of the local network file system of folders and directories that enables users to find locally stored datasets. These client/server systems needed some form of cataloging system that allowed users to browse data collections on remote servers.


The initial role of THREDDS was to supply tools that would create catalogs of, and provide client access to, the collections of data on remote servers.  These catalogs are machine readable lists of datasets available on OPeNDAP servers with enough user-readable metadata to allow users of THREDDS-enabled clients to browse catalogs of remote datasets just as they browse the file system on their own workstations.  The inventory level catalogs also supplied the use metadata required to enable client software to do reasonable things with the data once it was accessed.   Early on in the project, it was recognized that the the simple inventory list catalogs were not sufficient.  In fact, a hierarchy of catalogs is needed so that groups of inventories could be catalogued at a higher level.  For example, all the inventory catalogs for output from NCEP forecast models can be grouped into an NCEP model catalog.  LEAD model output can be grouped into a LEAD model output catalog, etc.  These catalogs of catalogs can be grouped at higher levels.  To get a sense of how this works on practice, you can browse the THREDDS Top Level Catalog on Motherlode: http://motherlode.ucar.edu:8080/thredds/catalog.html

Below is a screenshot of the top level THREDDS catalog on the Unidata motherlode server.


 

Similar THREDDS catalogs are available at many other sites, including the NCDC NOMADS (NOAA National Operational Model Archive & Distribution System)

site

 

http://nomads.ncdc.noaa.gov:8085/thredds/catalog.html

 

Fleet Numerical Meteorlogocial and Oceanographic Data Center's US GODAE (Global Ocean Data Assimilation Experiment) site

 

http://motherlode.ucar.edu:8080/thredds/catalogServices?catalog=http://usgodae1.usgodae.org/argo_catalog.xml

 

and the NCAR Community Data Portal

 

http://motherlode.ucar.edu:8080/thredds/catalogServices?catalog=http://dataportal.ucar.edu/metadata/ucar.thredds

Search Systems and Third Party Catalogs

 

Another way to look at these catalogs is that they are textual documents that have special pointers to binary datasets that can be accessed via client application software using special protocols like OPeNDAP.  Two important THREDDS capabilities relate to this characteristic.
 


This use of the THREDDS technology can take the form of a web publication describing a scientific phenomenon with embedded pointers that initiate THREDDS-enabled client software such as the IDV and have it bring in data from remote THREDDS-enabled servers.  At present, one needs a properly configured workstation to take advantage of these "data interactive" or "compound" publications, but it can be done at least for java-based THREDDS clients.  Some examples of such data interactive documents are listed at http://www.unidata.ucar.edu/projects/THREDDS/DataPublications/

netCDF (Network Common Data Form)

From early on, OPeNDAP and THREDDS were closely tied to the netCDF which is an interface for array-oriented data access and a library that provides an implementation of the interface. The netCDF library also defines a machine-independent format for representing scientific data. Together, the interface, library, and format support the creation, access, and sharing of scientific data.  It is by far the most widely used Unidata technology.

In its original implementation, OPeNDAP provided a special version of the netCDF library interface that for applications like IDL and Matlab to link to.  Because those desktop applications were already set up for access to access local netCDF files, the special OPeNDAP-enabled libraries allowed them to access remote files on OPeNDAP servers without any changes to the IDL or Matlab code.  From the beginning, OPeNDAP leveraged the netCDF interface.

One thing to be aware of  is that the Java implementation of the netCDF interface is a separate implementation which is used for experimenting with new features and facilities.  So some capabilities are available in Java netCDF that are not yet incorporated into the others.

NetCDF-Java 2.2 is a 100% Java library which includes a prototype implementation of the Common Data Model (CDM ). This netCDF API supports access several file formats:

and provides access to THREDDS catalogs.

HDF (Hierarachical Data Format)

The Hierarchical Data Format (HDF) came into being shortly after the netCDF.  Curiously, part of the reason for developing the HDF was the mistaken impression that netCDF was going to be marketed as a commercial product to recover development costs -- as was the case with many software packages developed in the 80s.  According to the web site: The HDF software includes I/O libraries and tools for analyzing, visualizing, and converting scientific data. There are two HDF formats, HDF (4.x, generally known as HDF4, and previous releases) and HDF5. These formats are completely different and NOT compatible. (NOTE: There are no plans to drop support for HDF 4.x.)

HDF provides many functions similar to netCDF, but in broad-brush terms, it has many more features than netCDF and, as a consequence, it is a more complicated interface.  For many years, there were many pleas from users (but no funding) to bring the two technologies together.  In the end, NASA did fund a joint project between Unidata and the HDF Group to develop a netCDF4 that would enable access to data stored in HDF5 files.  The netCDF4 components are complete but await the HD5 read/write components.

CDM (Common Data Model)

Experiences with netCDF development and support within Unidata, OPeNDAP as the support center and in conjunction with THREDDS, as well as HDF as part of the netCDF4 project led Unidata to consider the advantages of various characteristcs of the data models associated with each technology.  According to John Caron, the primary CDM architect, at the data access level, the CDM maintains as much as possible of the elegance of the netCDF-3 inteface, but add important features from OPeNDAP and HDF, most notably:


The CDM is implemented in Java netCDF.  For those aquainted with UML diagram representations of data models:

Common Data Model (data access layer) UML Diagram




Standards-based Interfaces

Interoperability with the GIS (Geographic Information Systems) community has been a primary focus of the second generation THREDDS.  The avenue THREDDS has taken is that of open standards web services protocols -- namely those developed by the Open Geospatial Constortium (OGC).  Because of the need to make image and gridded forecast datasets available, the main thrust of the initial effort was on the Web Coverage Service (WCS) specification.  Unidata spearheaded a specific OGC Interoperability Experiment, called GALEON (Geo-interface for Air, Land, Environmental, Oceans NetCDF).  Most of the activity of the first phase of GALEON has focused on testing the interactions between WCS clients and servers for netCDF datasets, modifying those client and server implementations based on the testing, and recommending modifications and augmentations of the relevant OGC interfaces where appropriate. The status of these implementation is described on the GALEON wiki Implemenation and Progress Page: http://galeon-wcs.jot.com/WikiHome/Implementation%20Progress%20Page.

The overall GALEON target is this general goal of interoperability via standards-based web services interfaces. But there is one rather more specific objective that involves using these interfaces as the basis for a gateway between traditional GIS applications and datasets available in existing servers in the FES community. These servers which number in the hundreds are based on a set of client-server protocols that have evolved in the FES community over the last decade. The basic building blocks are NetCDF, OPeNDAP, ADDE, and THREDDS technologies. But there are other services built on these, for example LAS, GDS, and INGRID. There are already several hundred of these servers making a wide-variety and large volume of data available to existing client applications. So a key aim of GALEON is to expand the usefulness of these servers by adding a standards-based interface to provide a gateway so that WCS clients can access the datasets.


The diagram below is a schematic of this gateway implementation as it was envisioned at the time GALEON was initiated. Since that time, development work has integrated the underlying THREDDS/OPenDAP services into a package called the THREDDS Data Server (TDS) which has a rudimentary WCS interface built in.

 


Initial Concept of WCS-interface as a Gateway to Existing FES Services


The GALEON experiments have resulted in many recommended changes to OGC WCS specification.  Among the most important to the users of atmospheric and oceanographic netCDF datasets are:



TDS (THREDDS Data Server)

The THREDDS Data Server (TDS) integrates many of the technologies described in the above sections into a distributable, supported software package. As the web page indicates, TDS is a web server that provides metadata and data access for scientific datasets, building on and extending a number of existing technologies:
  1. THREDDS Dataset Inventory Catalogs are used to provide virtual directories of available data and their associated metadata. These catalogs can be generated dynamically or statically.
  2. The Netcdf-Java library reads NetCDF, OpenDAP, and HDF5 datasets, as well as other binary formats such as GRIB and NEXRAD into a "Common Data Model" (CDM). This is an abstract data model that the netCDF (Unidata), HDF5 (NCSA) and OPeNDAP (University of Rhode Island) developers are using to converge their respective data models. The CDM also adds "Georeferencing Coordinate Systems" and specialized "Scientific Data Type" layers, which provides the semantics needed to convert datasets to other protocols and formats such as those required by GIS systems. The library adds this information by parsing well known "attribute conventions", and by using THREDDS metadata to add missing coordinate system information and other metadata.
  3. An integrated server provides OpenDAP access to any datasets that can be read through the Netcdf-Java library. OpenDAP is a widely used, subsetting data access method built on the HTTP (web) protocol.
  4. An integrated server provides bulk file access through the HTTP protocol.
  5. An integrated server provides data access through the OpenGIS Consortium (OGC) Web Coverage Service (WCS) protocol for any "gridded" dataset whose coordinate system information is complete. Users can add missing information to a dataset where needed, in order to make this work.


The THREDDS Data Server is implemented in 100% Java, and is contained in a single war file, which allows very easy installation into the open-source Tomcat web server. This means that users can implement the entire package on nearly any computing system. The OAI Harvester refers to a system which can gather metadata using a standard interface defined by the Open Archives Initiative (OAI) -- another example of open standards that enable different groups to collaborate with others by implementing standard web interfaces. In this case the OAI protocol allows data discovery sites to gather metadata from TDS sites.



THREDDS Data Server
THREDDS Data Server Schematic

Because the TDS enables data access via most formal and de facto standard interfaces, it allows users to access a wide variety of data using the tools with which they are familiar.  These range from browser based tools such as the Live Access Server (LAS) to powerful desktop applications such as the Unidata IDV described in a subsequent section.  One set of client applications that are of particular interest in the interoperability context are the ESRI arcGIS products.  In release 9.2 of arcGIS, the tools have the ability to read and write netCDF files that conform to the CF (Climate and Forecast) conventions.  Slated for the 9.3 release is remote data access via the WCS protocol.  That means that traditional GIS users can access netCDF weather and oceanographic datasets right now if the data are local and they will be able to access them remotely via WCS in the next release.

Integrated Data Viewer (IDV)

Unidata's Integrated Data Viewer (IDV) is a Java(TM)-based software framework for analyzing and visualizing geoscience data. The IDV brings together the ability to display and work with satellite imagery, gridded data, surface observations, balloon soundings, NWS WSR-88D Level II and Level III radar data, and NOAA National Profiler Network data, all within a unified interface.  For the Unidata community (and many others as it turns out), it is provides a powerful desktop analysis and display application that can access datasets that reside on remote THREDDS Data Servers.

THREDDS-related Projects

Another article lists a number of other projects related in some way to THREDDS.

 http://www.unidata.ucar.edu/projects/THREDDS/GALEON/Reports/RelatedTechnologies.html

An overview of GALEON (Geo-interface for Air, Land, Environmental, Oceans NetCDF) is also available

 

http://www.unidata.ucar.edu/projects/THREDDS/GALEON/Reports/GALEONoverview.htm