|
|
|||
|
||||
John Caron, Ethan Davis, and Dave Fulker
April 2001
The THREDDS system focuses on providing access to PICats through data visualization and analysis clients. This provides a searchable, distributed scientific data collection, as well as a data-analysis and display system that can use this collection. On top of this system, other components, such as educational resources, can be built.
The THREDDS system has four main components. First, data-access protocols such as DODS, HTTP, and FTP provide Internet access to scientific data sets, using URLs to name the data sets. These existing protocols are already in wide use by data servers in the scientific community. Second, the proposed PICats provide lists of available data sets and a framework for specifying the semantics of data sets, sometimes called “use metadata.” The PICats will contain Dataset Inventory and Dataset Description components. Third, the proposed PICat Servers are distributed processes that monitor a set of PICats and provide integrated discovery services. Fourth, existing visualization and analysis clients will be extended to connect to the PICats, PICat servers, and the data servers themselves.
Each server will maintain an inventory of available data sets, called a Provider Dataset Inventory. At a minimum, this is a listing of available data sets, and an association of a human-readable name with the data set URL. Optionally, the data sets can be collected into meaningful hierarchies, and descriptive information and certain other semantics (e.g., time) can also be specified here. (See Dataset Inventory URL for an prototype example of a Dataset Inventory XML file.)
A Dataset Inventory will have a URL and will be delivered as an XML document via HTTP, so that it can be viewed in a browser. More importantly, an inventory can be referenced by other inventories, and can be read by PICat Servers and other distributed processes to construct special purpose inventories and specialized search and discovery services. Inventory generation and maintenance will be done by scripts that scan directories and generate XML files; data ingesters and decoders (e.g., in the LDM) that update inventories dynamically as data are received; and special purpose HTTP servers for more complex processing. These themed data inventories will cover very large collections spanning entire contents of one or more server sites to small collections of data sets related to a specific educational topics. Furthermore, the inventories may reside at a server site, be built at a central digital-library site, or be integrated into educational materials such as a course description.
Since data sets are referenced by URLs (rather than filenames), third parties can construct inventories that reference data sets from any server. These new inventories can be built around particular themes (for educational purposes, for example) and use data set names and descriptions appropriate for those themes. These Themed Dataset Inventories use the same XML format as data provider inventories and so are equivalent from a user or client software perspective.
Themed Dataset Inventories can be thought of as logical “views” of Provider Dataset Inventories. We will support logical views of the data sets themselves whenever the underlying data-access protocol supports them. We currently are working with DODS developers to provide aggregation services that allow groups of DODS data sets to be viewed as a single “aggregated” set. This capability can greatly simplify the inventory cataloging for many important data holdings.
Dataset Descriptions are referenced from Dataset Inventories and contain field names, coordinate systems, units, and other metadata. Since predicting the metadata needs of all users is impossible, we will concentrate on commonly used data sets that comprise a large percentage of the THREDDS data holdings: (1) gridded data, (2) satellite data, and (3) point data. In collaboration with our partners, we will construct an extensible framework in which other data sets can be included. (See Dataset Description:URL for a prototype example of an XML file for gridded data.)
Clients cannot rely on guessing the semantics of a field based on its name. We will prototype sets of Standard Quantities that create a controlled vocabulary of physical quantities and describe their semantics as concisely as possible. By providing mappings from these standard quantities to the actual data-set field names, data can be presented or searched using standardized names. Again we will concentrate on commonly used data sets, and on creating an extensible framework for other data sets.
Dataset Descriptions will also be used to provide additional metadata not in the original data set. This will be a powerful tool in the creation of specialized views of data sets and the creation of themed inventories. We will explore the creation of data servers that transparently incorporate Dataset Description metadata.
A PICat is an Inventory with associated Descriptors and Standard Quantities, all of which are XML documents available through URLs. The XML files must be readable and simple to generate allowing small data collections to be included in THREDDS.
A PICat Server is an application that monitors a specified list of PICats. It is itself a PICat constructed from the union of its catalogs, and also provides other views of the catalog though grouping and subsetting. These services are provided through a simple API that is used by application GUIs or Web servers (via servlets) to present sophisticated data-set selection to users. PICat Servers will also provide a standard HTML representation of the catalog information using XSLT transformations on the XML files. This will provide a low-level, but key, browser-based interface to the catalog with clickable links to the data sets.
The THREDDS PICat Server is a distributed system that will run at Unidata and on other THREDDS servers, each of which monitors a list of PICats and propagates changes to the other servers. A client can connect to any THREDDS PICat Server for services and get (approximately) the same information. Sites providing data through a PICat will not have to run a PICat Server to make their holdings available through THREDDS; they will be able to register with an existing server and have it advertise their holdings.
In the initial phase we will concentrate on enabling search and discovery services on three aspects of the data: category, space/time region, and standard quantities. The goal is to allow these searches using just the information in the PICats. More specialized search services can be constructed but may involve querying the data sets themselves. (Constructing these services is beyond the scope of this proposal.) We will explore the use of ADL’s geospatial technology for space/time searching (Frew et al., 1996).
A client application can connect to a specific PICat, or it can connect to the PICat Server, which provides a comprehensive inventory of THREDDS data sets. A user will be able to view available data by category, select the data set using a meaningful name, and view or download the data set. We will use Unidata’s visualization applications (Unidata:MetApps URL) to prototype this capability, and work with developers to transfer the technology to other clients.
The PICat Server is a digital library service that enables the discovery of scientific data-set inventories, and allows searching by category, standard quantity, and possibly space and time regions. An online digital library can include a user interface (or portal) to the THREDDS PICat Server, allowing users to search, discover, and download scientific data sets from a browser. A specially configured browser can pop-up a "helper application" such as MetApps that downloads the selected data set and allows the user to interactively view it. We will be working with the DLESE to prototype these capabilities within its Web interface.
The Dataset Descriptions themselves contain useful information that will likely be included in scientific digital-library catalogs (e.g., DLESE), especially those referring to important commonly used data sets. THREDDS data sets can be used as building blocks in the construction of educational resources, using THREDDS-enabled clients such as LAS, WXWise, EMDI, MetApps, etc. The PICat Server will be useful to find appropriate and available data sets for these resources.
The information in the PICat will enable more elaborate search mechanisms, but are outside the scope of this proposal. However, Themed Dataset Inventories and other types of third-party organization and augmentation of standard dataset holdings will be very valuable resources.
Data-set semantics. It is well known that specifying adequate data-set semantics is difficult, and specifying all metadata needed for all reasons is impossible. We will concentrate on important subsets of this problem (georeferencing, standard units/quantities in gridded data, satellite- and point-data sets) that we are confident are amenable to solution now and will be an important advance from the current state. We will then be able to include other data sets within that framework and/or generalizing the framework.
Data provider / consumer needs. Closely related to the problem of specifying adequate metadata is the difference between the needs of data providers, who prefer simple, minimal metadata standards, and the needs of data consumers, who may need rather complex metadata, especially for search and discovery. We expect to focus our standardization efforts on data provider metadata that will support distributed data generation, at the same time providing the “middleware” to generate the more complex metadata needed by centralized data catalogs and search facilities that use sophisticated metadata standards.
Data-set persistence. Data sets can be dynamic, and references (URLs) to them are easily broken. Inventory-generating tools will need to know how often inventory catalogs need updating.
Catalog synchronization. The PICat Server will be distributed over several (potentially many) host machines. Creating a consistent view of the datasets is non-trivial, but this problem has been well studied and solutions are known (Birman 1996).
| Contact Us Site Map Search Terms and Conditions Privacy Policy Participation Policy | ||||||
|
||||||