Unidata - To provide the data services, tools, and cyberinfrastructure leadership that advance Earth system science, enhance educational opportunities, and broaden participation. Unidata
         
  advanced  
 

THematic Real-time Environmental Distributed Data Services
( THREDDS)

Submitted to NSF NSDL Collections Track: April 11, 2001

(The format of this version has been adapted for Web publication but the content remains the same as the one submitted)

A. Project Summary

 

We propose the construction of a prototype system for Thematic Real-time Environmental Distributed Data Services (THREDDS) that will make it possible for educators and researchers to publish, locate, analyze, and visualize a wide variety of environmental data in both their classrooms and laboratories.    Just as the World Wide Web and digital-library technologies have simplified the process of publishing and accessing multimedia documents, THREDDS will provide needed infrastructure for publishing and accessing scientific data in a similarly convenient fashion.

 

THREDDS will establish both an organizational infrastructure and a software infrastructure.  A team of data providers, software tool developers, and metadata experts will work together to develop a software framework that allows users to publish, find, analyze, and display data residing on remote servers. The software framework, based on a concept of publishable data inventories and catalogs, will tie together a set of technologies already in use in existing, extensive collections of environmental data: client/server data-access protocols from the University of Rhode Island and the University of Wisconsin-Madison, Unidata’s real-time Internet Data Distribution system, the discovery system at the Digital Library for Earth System Education (DLESE) and an extensive set of client visualization tools.

 

The heart of THREDDS, however, is metadata contained in the publishable inventories and catalogs (PICats).  Based on the eXtensible Markup Language (XML), PICats can be created in many different ways.  Sites receiving real-time environmental data will instrument decoders to create PICats describing data products as they arrive.   Crawlers will be implemented to create PICats by traversing existing retrospective data collections.   Since the PICats do not have to reside on the server with the data, researchers will be able to create PICats for research publications that point to datasets residing on several data servers.  Educators will incorporate PICats of illustrative datasets into educational modules that also include the tools for data analysis and visualization.  Indeed students will eventually be able to use PICats to point to datasets related to their research projects, just as they now use URLs to point to relevant documents.   Since they are text-based, PICats can be “harvested” and indexed in digital libraries using specialized tools that make use of the internal structure and semantic content as well as by tools similar to those used by current document search engines.

 

We have a large set of collaborators committed to working together on the development and implementation of this technology into their data servers, client analysis and display applications, and, ultimately, into the NSDL through DLESE.

 

In brief, THREDDS represents a broadly-based community effort, managed by the Unidata Program, to enable learners, educators, and researchers— regardless of their institution’s size, in-house computer expertise, or academic level—to publish, find, and use current and retrospective environmental data.  In sum, THREDDS moves data publication, discovery, and usage from the arcane (where location, formats, and filename conventions must be known) to the mundane where the underlying complexities are transparent to the users.

 

B. Table of Contents

A. Project Summary. 1

B. Table of Contents. 2

C. Project Description and Results from Prior NSF Support: 2

THematic Real-time Environmental Distributed Data Services (THREDDS) 2

C.1. Vision. 2

C.2. Existing Data Systems and Services. 3

C.2.1. Traditional Data Servers (Pull Sites) 3

C.2.2. Subscription, Event-driven Push Systems. 4

C.2.3. Client/Server Data Access. 4

C.2.4. The Missing Link: Discovery and Usage Metadata. 5

C.3. A THREDDS Strategy. 6

C.4. THREDDS Components. 8

C.4.1. Data Providers. 8

C.4.2. Common server tools. 9

C.4.3. Client Analysis and Display Tools. 9

C.4.4. Metadata Expertise. 10

C.5. Technical Underpinnings. 10

C.5.1. Dataset Inventory. 11

C.5.2. Dataset Description. 11

C.5.3. PICat Servers. 12

C.5.4. Digital Library Services. 12

C.5.5. The Known Challenges. 13

C.6. Technical Statement of Work and Milestones. 13

C.7. Management, Sustainability and Collections Content 15

C.8. Encouraging Diversity. 16

C.9. Why Should Unidata Do This?. 16

D. References Cited. 17

 

C. Project Description and Results from Prior NSF Support:

THematic Real-time Environmental Distributed Data Services (THREDDS)

C.1. Vision

Data collections are a cornerstone of scientific research and education. New levels of accessing and using data are now achievable because of evolving technologies, even as the amount and variety of Earth system data are increasing daily. Recent parallel progress in the worlds of scientific data management and education-oriented digital libraries is highlighting a common need to discover widely distributed data sets, and to use unfamiliar data meaningfully with a comprehensive set of analysis tools for:

 

·        Visualizing complex, multidimensional data,

·        Integrating and overlaying data from multiple sources, and

·        Gracefully handling coordinate systems, measurable quantities, units of measure, and sampling variations.

 

To address this issue, we envision a prototype scientific data web that will facilitate the publication, discovery, and use of environmental data, just as the World Wide Web has made the publication of and access to textual and multimedia documents simple and straightforward.   

On the publication side, scientists who generate specialized data sets, including data created minute to minute by automated observing platforms, should be able to add to the web with minimal effort by contributing the data to servers using tools that generate the appropriate metadata for cataloging and data-access facilities.

On the discovery and usage side, broad access to data and analysis tools in this scientific data web will enable scientists to publish data sets and create online publications that point directly to them. It will enable educators to work with data in classrooms, faculty to examine and incorporate data from other disciplines, and students to explore and test their ideas using the yardstick of data.  It will provide rich discovery mechanisms, designed to reflect concepts of importance to the education community being served, with data cross-indexed and cross-referenced by multiple themes. We are proposing to build a prototype of this scientific data web, which we are calling THREDDS (Thematic Real-time Environmental Distributed Data Services), as a first step toward achieving this vision.

C.2. Existing Data Systems and Services

Our intention is to build THREDDS on existing technological and organizational components wherever possible.  For example, using Unidata systems, undergraduates at over 100 universities have been working with real-time environmental data in their classrooms and laboratories for over a decade.  Other centers are making huge collections of retrospective data available on Web-accessible servers at the same time.  Still other groups are working on metadata systems in a variety of fields — several of them in the environmental data arena.  Considerable progress has been made in digital libraries, especially for facilitating access to online educational materials. 

In recent years, three basic models for providing users with access to data have evolved: traditional servers where users pull data from the server, push models like Unidata's IDD (which pushes data to participants), and specialized client/server systems.

C.2.1. Traditional Data Servers (Pull Sites)

Pulling data across the Internet from a data server is the most widely implemented model. Pull-based data-server sites range from large NASA Distributed Active Archive Centers and NOAA collections, to discipline-oriented centers such as the Incorporated Research Institutions for Seismology Data Management Center (IRIS DMC), and the National Center for Atmospheric Research (NCAR). These servers operate independently (the data holdings are not cross-referenced except in a few cases) and, although some sites provide data-analysis tools, they generally require users to download data sets and convert them to appropriate formats before in-depth analysis can be done. Most server sites now have some form of catalog browse and search facility that allows users to peruse the site’s holdings via the Web. A few sites are incorporating analysis tools into the server, so users can manipulate the data on the server and have the resulting visualization displayed locally via a Web browser.  Figure 1 illustrates the data server model. 


Figure 1: Typical Data Center Approach

Figure 2: Unidata IDD Receiving Site


C.2.2. Subscription, Event-driven Push Systems

The NSF-funded Unidata Program (Fulker et al., 1997) has implemented a data dissemination model (Domenico et al., 1994) that assumes users know which real-time data products they want; users perform analysis and display on their local systems.  In Unidata's Internet Data Distribution (Unidata:IDD URL) system, users subscribe to certain data streams from a variety of sources. The IDD then delivers data products as soon as they are available from each source. The Unidata Local Data Manager (Unidata:LDM URL), the workhorse behind the IDD, is responsible for relaying the data and can be configured to run decoders locally on receipt of user-selected data products. The decoders convert the data products into formats suitable for analysis and display packages, which Unidata also supplies and supports. In this model, depicted in Figure 2, the data are delivered automatically to the receiving site, converted as they are captured, stored on the users' computers, and analyzed by their applications.

C.2.3. Client/Server Data Access

Several data centers are implementing servers that can be accessed over the Internet by appropriately enabled clients. This client/server approach gives users the control and processing power of a local client application while analyzing data on one or more remote servers. In particular, the Distributed Ocean Data System (DODS: URL), from the University of Rhode Island (URI), and the Abstract Data Distribution Environment (ADDE URL), from the Space Science and Engineering Center (SSEC) at the University of Wisconsin-Madison, allow suitably adapted applications to perform data analyses on data holdings at remote sites. Client/server data access is shown schematically in Figure 3. Both DODS and ADDE permit clients to directly access sub-portions of very large remote data sets that, under the pull model, would be difficult to acquire due to bandwidth or storage constraints.

C.2.4. The Missing Link: Discovery and Usage Metadata

In spite of progress in establishing large numbers of data centers, real-time delivery systems, and tools for analysis and display, we are still a long way from an environmental data web that is as easy to use as the Web.  The problem is complicated by the difficulty in finding relevant data sets, even for people in the research community.  Indeed, data queries are sometimes hard to pose correctly.   For example, what data are pertinent to understanding El Nino?   Furthermore, the people looking for data often have different points of departure: some may wish to use a search engine via a Web browser, while others may search for data from within the context of a data-analysis tool or even from within an educational module on a particular scientific topic.

 

Even after locating the appropriate data, significant problems often arise: the data may be hard to understand—especially for those unfamiliar with the discipline—and the data may be excruciating to use if they are not in a form compatible with familiar software tools.  Furthermore, visualization and other software tools require data that have machine-readable semantics, such as spatial and temporal referencing, units of measure, parameter names, and so forth.  The “meaning” of a data set depends upon the circumstances of its creation or collection and the perspective and knowledge of the user.  While important historical data are now somewhat easier to access, they remain difficult to find, understand, and use. In addition, they may be poorly organized for some problems.

 

These difficulties are largely the result of inadequate metadata. To find specific data, there has to be enough information describing the data collection to enable a successful search. Metadata is a description of the data in terms a human might use in searching. There also needs to be usage metadata—the machine-readable information that enables an application to learn enough about the format and semantic content of the data set to transform the data into a form that the application can use. This then is the technical heart of the proposal:  To build the metadata infrastructure into a number of data servers and analysis-and-display tools.

 

Creating metadata is a non-trivial challenge. To be useful, there must be agreement on meaning across a wide spectrum of data sites. Several ongoing efforts are in place to achieve this, such as MARC (a Library of Congress standard; MARC: URL), Dublin Core (Dublin Core: URL), and GILS (as ISO 23950 standard; see GILS: URL). Standardizing metadata for scientific data sets has been a continuing effort among data providers and users as well (Content Standard for Digital Geospatial Metadata [CSDGM: URL ], ISO 19115 [URL], Global Change Master Directory [GCMD: URL], but progress has been slow. We are particularly aware of the work within some data-format and data-provider communities (see NetCDF, HDF, and DODS URLs, for example) that specify metadata conventions for their systems, of work by several groups in the data discovery area (see GCMD, MEL, and NOAAServer URLs, for example), and of the work being done in the digital library community [Digital Library for Earth System Education (DLESE) and Alexandria Digital Library (ADL), for example].  Recently, much attention has been given to Extensible Markup Language (XML) representations of metadata (Sullivan 2001, Frew 2000).  Representatives of the following primary Earth system metadata initiatives will be active participants in the proposed work: Extended Systems Modeling Language (ESML:URL), DIstributed MEtadata System (Yang, n.d.) and the OpenGIS Web Map Server Interfaces Implementation Specification (WebMap: URL).


Figure 3: Client/Server Data Access

Figure 4: THREDDS Schematic


C.3. A THREDDS Strategy

The THREDDS approach builds on the strengths of the community of data providers, visualization tool builders, digital libraries, and metadata experts.  It also provides a mechanism for extending the DLESE discovery system (Sumner et al., 2001) to embrace the metadata in what we refer to as Publishable Inventories and Catalogs (PICats).

 

To accomplish this, we will build two new essential components: a formal definition for PICats and software to facilitate their use.   As described later, PICats will be built using XML transported via HTTP (i.e., on the Web), and will refer to data sets that are usable via DODS, ADDE, or other direct-access methods. Needed software includes tools to create standards-compliant PICats, plug-ins or server side-visualizers to enable the use of PICats in browsers, and components that help developers incorporate PICats into applications.

 

On the discovery side, we will propose and help construct mechanisms for extending the DLESE discovery system to embrace PICats as standard resources.  This will be done as an experiment guided by the DLESE Data Access Working Group (DAWG).  The experiment will add document types and other elements to effectively characterize PICats in the DLESE metadata framework, refine the PICat-creation tools for DLESE compatibility, implement automatic “harvesting” of PICat metadata, test the system from applications and browsers, and, after an iterative refinement, adopt the methodology.  Approval by the DAWG and DLESE Steering Committees will be needed before a recommendation can be made to the National SMETE Digital Library (NSDL) for adoption at that level.

 

 

The THREDDS strategy will facilitate the publication of data sets in a variety of forms since metadata descriptors can be built anywhere to create “virtual aggregations” of data sets and to characterize data sets in meaningful ways (including metadata needed by visualization tools).  This reduces the demand for constructing user-specified files of data, and for disseminating metadata to new users.  Data users would gain multiple views of the same data collections, each tailored to specific education/research contexts in addition to automated access via analysis/visualization tools.

 

The THREDDS approach ties in with DLESE and the tentative NSDL information architecture as shown in the following schematics.  These are based on the work of the NSDL Technical Working Group:

 


 

Figure 5: Tentative NSDL Information Architecture

Figure 6: THREDDS PICat Elements within NSDL Architecture


The publishable inventories and catalogs become part of the NSDL collections.   Indirectly then the referenced data sets at the data provider sites are part of NSDL as well.  Likewise, since the PICats and data are accessible through the THREDDS-instrumented analysis and visualization applications and applets, these tools become an alternative interface to NSDL.

 

Our strategy also ties in with expanding flows of real-time data and with the growing Unidata community of users.  The strengths of the LDM software and the IDD system will be employed to populate DODS and ADDE servers with current data and to implement automatic PICat generation.  Unidata’s data-analysis and visualization tools will be enhanced to use data acquired by both push and pull methods which in turn will increase the scope of available data and, for some users, will reduce the pressures on local storage and networking systems.  We also anticipate that THREDDS will foster an expansion of the Unidata community to encompass a broader disciplinary scope.  Specifically, support for remote data access, and the power of PICat to make data easier to understand and use, will lower the barriers to the use of environmental data.

In addition to Unidata’s IDD system, we have engaged a number of different communities. These are:  data providers who offer a set of diverse holdings and are willing to provide access via the DODS and ADDE client/server protocols; builders of data analysis and display applications who are willing to incorporate THREDDS data-access components into their applications; metadata experts to guide that aspect of the development; and, through DLESE, a connection to the digital library and Earth-system education community.

C.4. THREDDS Components

C.4.1. Data Providers

The following institutions have agreed to be data-server partners (themes and contacts noted parenthetically):

·         NCDC, the National Climatic Data Center (climate, Ben Watkins);

·         NGDC, National Geophysical Data Center (geophysical, Ted Habermann);

·         SSEC, the Space Science and Engineering Center at the University of Wisconsin-Madison (GOES satellite data, Steve Ackerman and Tom Whittaker);

·         IRI/LDEO, International Research Institute/Lamont Doherty Earth Observatory (climate, oceanographic, Benno Blumenthal);

·         PMEL, the Pacific Marine Environment Laboratory (oceanographic, marine, Steve Hankin);

·         NCAR, the National Center for Atmospheric Research (atmospheric, oceanographic, Don Middleton);

·         CDC, the Climate Diagnostic Center (climate, Roland Schweitzer);

·         FNMOC, Fleet Numerical Meteorological and Oceanographic Center (oceanographic, Dave Dimitriou);

·         GMU/COLA, George Mason University/Center for Oceans Land Atmosphere (hydrologic, Menas Kafatos and Ruixin Yang);

·         University of Alabama Huntsville (satellite and hydrology, Sara Graves and Rahul Ramachandran); and

·         The ADDE servers in the Unidata community (real-time atmospheric data, Tom Yoksas).

Note that NCAR and SSEC will serve as testbed sites for server-side software.  As the project progresses and the common underpinnings are tested at the initial sites, additional sites will be added. Sites under consideration are:

·         IRIS DMC, Incorporated Research Institutes for Seismology Data Management Center (seismic, Tim Ahern);

·         University of Oklahoma (radar, Kelvin Droegemeier)

·         ARM (Atmospheric Radiation Measurement, Chris Klaus); and

·         University of Florence (European satellite data, Stefano Nativi).

Letters of commitment from data providers who have agreed to be partners in this project can be found in Section I: Supplementary Materials of this proposal. The data being offered range from climate and weather data to oceanographic, marine, and satellite data. These collections are already in place but there is no methodology for a unified search across all the independent servers. As partners, these data providers have agreed to:

·         Supply the hardware and system administration required to run the THREDDS servers;

·         Provide access to their characteristic data sets via DODS and/or ADDE client/server protocols in addition to more traditional methods (e.g., FTP, tapes);

·         Provide browser/server access to certain data sets via Web-based thin clients such as PMEL’s Live Access Server (LAS) where appropriate;

·         Using LDM/IDD technology (where appropriate), make real-time data available on the server; and

·         Work with Unidata to incorporate systems for expanded metadata to make it easier for users to find data sets and to use them once found.  This is the key component that will tie the server systems together, enable remote clients to find and access the data, and connect the servers with the DLESE discovery system.

C.4.2. Common server tools

THREDDS will build on a number of technological components that are already in place or are under development, independent of this proposal:

·         DODS and ADDE protocols for client/server data access. These have been developed at URI and SSEC, respectively. These data-transport protocols have already been implemented on a large number of data servers nationwide. We propose to build on the current deployment, add to it, and enhance it with a coordinated discovery system.

·         To provide "thin client" access to data via a Web browser, some servers implement part of the data-access system on the data server itself. LAS and INGRID are examples of such thin client data-access systems. In this case THREDDS will work with the developers at PMEL and LDEO/IRI to incorporate coherent metadata systems into LAS and INGRID.

·         Even more data-processing power be achieved on the server by implementing a comprehensive set of analysis tools such as those found in GrADS (Gridded Analysis and Display System). George Mason University and the Center for Oceans Land and Atmosphere (COLA) have built an integrated GrADS/DODS (GDS) server that provides comprehensive data-analysis capabilities as part of the data server. As with DODS itself, it will be important to integrate THREDDS metadata systems into servers that use GDS.

·         Unidata’s IDD system, based on the LDM, for automated real-time data delivery to the distributed servers.

·         Analysis and display tools as discussed in the following subsection.

C.4.3. Client Analysis and Display Tools

The THREDDS prototype will provide examples of a wide variety of working applications that use our metadata framework to find, analyze, and display data from server sites.  This will demonstrate an end-to-end system for data access and visualization. The following developers will incorporate our client-side data-access components (class libraries and metadata access) into their own data manipulation tools:

·         Live Access Server (LAS, PMEL, Steve Hankin). LAS illustrates the use of a Web-based (thin) client with the bulk of the analysis and display generation done on the server side.

·         INGRID (IRI/LDEO, Benno Blumenthal). This is another example of a system enabling analysis and display of data via a Web browser.  As with LAS and GDS, INGRID provides substantial data-analysis capabilities.

·         WXWise applets (the University of Wisconsin-Madison, Tom Whittaker). These applets illustrate the use of Java to embed data-analysis and display tools directly into educational modules on a Web site.

·         The Virtual Exploratorium (the University of Illinois, West Chester State, DLESE, and NCAR, Don Middleton). This application incorporates the educational functions directly into the data analysis and display tool itself.

·         Data Discovery Toolkit and Foundry based on EDMI (Earth Data Multimedia Instrument, New Media Studio, Bruce Caron). These are a set of data-analysis and display tools based on IDL and Macromedia Director. They can be used to generate very elaborate educational modules.

·         MetApps (Unidata Program Center, Don Murray). A set of pure Java, platform-independent, two- and three-dimensional data-analysis and display tools—based on the VISAD infrastructure.

·         VISAD infrastructure from SSEC (Bill Hibbard of the University of Wisconsin-Madison in conjunction with the Unidata Program Center).

·         Others: Some software packages (MatLab, IDL, McIDAS, etc.) already have been adapted to acquire remote data via DODS or ADDE.  Even if these systems are not adapted to take direct advantage of PICats or other THREDDS advances, their users will benefit from data available on THREDDS servers.

C.4.4. Metadata Expertise

As noted elsewhere, the technological core of this proposal, the crucial component that has yet to be developed, is a system for adding the semantic description of scientific data sets necessary for data manipulation and discovery. It must interoperate with data providers, data servers, data clients, catalog servers, discovery systems, and other middleware components. Investigators will select key scientific data sets and semantic descriptions developed for an end-to-end demonstration of the utility of this approach. Unidata staff will work closely with DLESE to ensure that the resulting metadata system will interoperate effectively with NSDL.

Partners with whom we will consult on matters of metadata and interoperability are:

·         The Earth System Markup Language (ESML, University of Alabama-Huntsville);

·         The DIstributed MEtadata System (DIMES, George Mason University);

·         The aggregation data catalog that is part of DODS (URI, Unidata);

·         DLESE;

·         The University of Florence (Italy), which will act as a liaison with the international metadata standards community.

C.5. Technical Underpinnings

The THREDDS system focuses on providing access to PICats through data visualization and analysis clients. This provides a searchable, distributed scientific data collection, as well as a data-analysis and display system that can use this collection. On top of this system, other components, such as educational resources, can be built.

 

The THREDDS system has four main components. First, data-access protocols such as DODS, HTTP, and FTP provide Internet access to scientific data sets, using URLs to name the data sets. These existing protocols are already in wide use by data servers in the scientific community. Second, the proposed PICats provide lists of available data sets and a framework for specifying the semantics of data sets, sometimes called “use metadata.” The PICats will contain Dataset Inventory and Dataset Description components. Third, the proposed PICat Servers are distributed processes that monitor a set of PICats and provide integrated discovery services. Fourth, existing visualization and analysis clients will be extended to connect to the PICats, PICat servers, and the data servers themselves.

C.5.1. Dataset Inventory

Each server will maintain an inventory of available data sets, called a Provider Dataset Inventory. At a minimum, this is a listing of available data sets, and an association of a human-readable name with the data set URL. Optionally, the data sets can be collected into meaningful hierarchies, and descriptive information and certain other semantics (e.g., time) can also be specified here. (See Dataset Inventory URL for an prototype example of a Dataset Inventory XML file.)

A Dataset Inventory will have a URL and will be delivered as an XML document via HTTP, so that it can be viewed in a browser. More importantly, an inventory can be referenced by other inventories, and can be read by PICat Servers and other distributed processes to construct special purpose inventories and specialized search and discovery services. Inventory generation and maintenance will be done by scripts that scan directories and generate XML files; data ingesters and decoders (e.g., in the LDM) that update inventories dynamically as data are received; and special purpose HTTP servers for more complex processing. These themed data inventories will cover very large collections spanning entire contents of one or more server sites to small collections of data sets related to a specific educational topics. Furthermore, the inventories may reside at a server site, be built at a central digital-library site, or be integrated into educational materials such as a course description.

Since data sets are referenced by URLs (rather than filenames), third parties can construct inventories that reference data sets from any server. These new inventories can be built around particular themes (for educational purposes, for example) and use data set names and descriptions appropriate for those themes. These Themed Dataset Inventories use the same XML format as data provider inventories and so are equivalent from a user or client software perspective.

Themed Dataset Inventories can be thought of as logical “views” of Provider Dataset Inventories. We will support logical views of the data sets themselves whenever the underlying data-access protocol supports them. We currently are working with DODS developers to provide aggregation services that allow groups of DODS data sets to be viewed as a single “aggregated” set. This capability can greatly simplify the inventory cataloging for many important data holdings.

C.5.2. Dataset Description

Dataset Descriptions are referenced from Dataset Inventories and contain field names, coordinate systems, units, and other metadata. Since predicting the metadata needs of all users is impossible, we will concentrate on commonly used data sets that comprise a large percentage of the THREDDS data holdings: (1) gridded data, (2) satellite data, and (3) point data. In collaboration with our partners, we will construct an extensible framework in which other data sets can be included. (See Dataset Description:URL for a prototype example of an XML file for gridded data.) 

Clients cannot rely on guessing the semantics of a field based on its name. We will prototype sets of Standard Quantities that create a controlled vocabulary of physical quantities and describe their semantics as concisely as possible. By providing mappings from these standard quantities to the actual data-set field names, data can be presented or searched using standardized names. Again we will concentrate on commonly used data sets, and on creating an extensible framework for other data sets.

Dataset Descriptions will also be used to provide additional metadata not in the original data set. This will be a powerful tool in the creation of specialized views of data sets and the creation of themed inventories. We will explore the creation of data servers that transparently incorporate Dataset Description metadata.

C.5.3. PICat Servers

A PICat is an Inventory with associated Descriptors and Standard Quantities, all of which are XML documents available through URLs. The XML files must be readable and simple to generate allowing small data collections to be included in THREDDS.

A PICat Server is an application that monitors a specified list of PICats. It is itself a PICat constructed from the union of its catalogs, and also provides other views of the catalog though grouping and subsetting. These services are provided through a simple API that is used by application GUIs or Web servers (via servlets) to present sophisticated data-set selection to users. PICat Servers will also provide a standard HTML representation of the catalog information using XSLT transformations on the XML files. This will provide a low-level, but key, browser-based interface to the catalog with clickable links to the data sets.

The THREDDS PICat Server is a distributed system that will run at Unidata and on other THREDDS servers, each of which monitors a list of PICats and propagates changes to the other servers. A client can connect to any THREDDS PICat Server for services and get (approximately) the same information. Sites providing data through a PICat will not have to run a PICat Server to make their holdings available through THREDDS; they will be able to register with an existing server and have it advertise their holdings.

In the initial phase we will concentrate on enabling search and discovery services on three aspects of the data: category, space/time region, and standard quantities. The goal is to allow these searches using just the information in the PICats. More specialized search services can be constructed but may involve querying the data sets themselves. (Constructing these services is beyond the scope of this proposal.) We will explore the use of ADL’s geospatial technology for space/time searching (Frew et al., 1996).

A client application can connect to a specific PICat, or it can connect to the PICat Server, which provides a comprehensive inventory of THREDDS data sets. A user will be able to view available data by category, select the data set using a meaningful name, and view or download the data set. We will use Unidata’s visualization applications (Unidata:MetApps URL) to prototype this capability, and work with developers to transfer the technology to other clients.

C.5.4. Digital Library Services

The PICat Server is a digital library service that enables the discovery of scientific data-set inventories, and allows searching by category, standard quantity, and possibly space and time regions. An online digital library can include a user interface (or portal) to the THREDDS PICat Server, allowing users to search, discover, and download scientific data sets from a browser. A specially configured browser can pop-up a "helper application" such as MetApps that downloads the selected data set and allows the user to interactively view it. We will be working with the DLESE to prototype these capabilities within its Web interface.

The Dataset Descriptions themselves contain useful information that will likely be included in scientific digital-library catalogs (e.g., DLESE), especially those referring to important commonly used data sets. THREDDS data sets can be used as building blocks in the construction of educational resources, using THREDDS-enabled clients such as LAS, WXWise, EMDI, MetApps, etc. The PICat Server will be useful to find appropriate and available data sets for these resources.

The information in the PICat will enable more elaborate search mechanisms, but are outside the scope of this proposal. However, Themed Dataset Inventories and other types of third-party organization and augmentation of standard dataset holdings will be very valuable resources.

C.5.5. The Known Challenges

Data-set semantics. It is well known that specifying adequate data-set semantics is difficult, and specifying all metadata needed for all reasons is impossible. We will concentrate on important subsets of this problem (georeferencing, standard units/quantities in gridded data, satellite- and point-data sets) that we are confident are amenable to solution now and will be an important advance from the current state. We will then be able to include other data sets within that framework and/or generalizing the framework.

Data provider / consumer needs.  Closely related to the problem of specifying adequate metadata is the difference between the needs of data providers, who prefer simple, minimal metadata standards, and the needs of data consumers, who may need rather complex metadata, especially for search and discovery. We expect to focus our standardization efforts on data provider metadata that will support distributed data generation, at the same time providing the “middleware” to generate the more complex metadata needed by centralized data catalogs and search facilities that use sophisticated metadata standards.

Data-set persistence. Data sets can be dynamic, and references (URLs) to them are easily broken. Inventory-generating tools will need to know how often inventory catalogs need updating.

Catalog synchronization. The PICat Server will be distributed over several (potentially many) host machines. Creating a consistent view of the datasets is non-trivial, but this problem has been well studied and solutions are known (Birman 1996).

C.6. Technical Statement of Work and Milestones

 
 

 

Table 1. Statement of Work with Milestones

 

Schedule of Proposed Work

Year One

Year Two

First Half

Second
 Half

First Half

Second
Half

Meetings: THREDDS Technical Task Force (T3F)

v

 

 

v

     Stakeholders

 

v

 

 

Dataset Inventory Component

 

 

 

 

     Define XML specification

 

 

 

 

         Prototype PICat auto-generator, real-time data-stream decoder

 

 

 

 

         Implement on testbed servers (NCAR & SSEC)

 

 

 

 

         Prototype Java API and demo use in Metapps

 

 

 

 

         Prototype data-set views for DODS dataset aggregation

 

 

 

 

Dataset Descriptor Component

 

 

 

 

     Define XML specification for gridded data