|
|
|||
|
||||
John Caron, Ethan Davis, D. R. Murray, and R. K. Rew
Unidata Program Center
University Corporation for Atmospheric Research
Boulder, Colorado
With ever-increasing amounts of data being generated by new observing systems and models, the meteorological community is experimenting with methods for delivering these data to those who need them for research and education. This paper presents an overview of current data distribution systems, and identifies some of their limitations with an eye towards their future evolution. We examine in some detail Unidata's Internet Data Distribution (IDD) system. We then specify what we believe are desirable goals for future data systems and impediments to achieving those goals. Finally we summarize some technologies that we have been tracking at Unidata that should be considered when looking for solutions to data distribution.
Data distribution can be placed in two categories. Data push systems deliver data to clients as the server receives it, based on what data the client has subscribed to. These systems work best for real-time data. The servers need only know enough about the data to categorize it for sending it on to their clients. Unidatas IDD is a push system which we describe in the next section, with detailed attention to some of the limitations that need to be solved soon.
Data pull systems deliver data only upon request, which is the traditional client-server model used, for example, by Web servers. These systems work best for archival data and for very large data sets. The data often needs to be made available in a variety of ways to the client, and so the servers metadata needs are more sophisticated. In section 2.2 we briefly describe a representative data pull system, the Distributed Ocean Data System .
Unidatas Internet Data Distribution (IDD) supplies near real-time data continuously to over 160 universities and research organizations, currently delivering an aggregate of nearly 200 Gbytes/day.
There is no data center; rather data is injected from multiple sources into the distribution system and flows through intermediate relay sites to multiple destinations, via cooperating systems running on geographically dispersed workstations. Each IDD site runs Unidata Local Data Management (LDM) software to receive data from upstream sites and to pass data along to downstream sites.
The LDM supports a notion of a feedtype identifying a distinct data stream with its own conventions for naming data. This is used for the coarse classification of data and for subscription requests. The data for any particular feedtype is a sequence of data products, each of which contains a product identifier, a feedtype, a time stamp and the array of bytes that comprises the product data. Each LDM maintains a product queue, which is a local collection of data products that have arrived from upstream sources. Typically, the most recent hour's worth of data is kept in each product queue to provide reliability in redistributing the data to downstream sites in the presence of outages and congestion. The architecture is event-driven, so that each data product flows through the system as soon as it is available from the source, rather than waiting for scheduled delivery or relying on polling.
Although the IDD currently delivers data reliably, and is handling recent increases in the amount of data, there are indications that the LDM is nearing its limit. The LDM will need to be redesigned if it is to handle the many new (and large) data streams soon to become available. Such a redesign provides opportunities to consider new technologies that might be useful in a future IDD. Most of the limitations with the LDM are problems of scaling, which we categor-ize in the following ways:
1. Number of data products When the LDM was first deployed, it handled about 8,000 data products per hour. With the future availability of NIDS products and other new data feeds, support is needed for at least an order of magnitude more data products in local product queues. Insertion and deletion of products into the LDMs product queue currently depends on the number of products already in the queue, slowing down the LDM too much when there are many more products.
2. Data volume The LDM product queue uses 32 bit integer offsets to locate data products within the queue. This means aggregate queue sizes cannot be larger than about 4 Gbytes. With the large sizes of high resolution imagery and high-resolution model output, as well as the multiplying effect of ensemble model outputs, 4 Gbytes is too small a limit on queue size.
3. Number of data feeds The current imple-mentation is limited to 32 feedtypes, and uses a bit set for quickly testing membership in unions and intersections of feed type sets. We are currently using about 21 feedtypes, and with the new data sources coming online, we will soon run out.
4. Number of data sources and sinks As the number of data feeds and number of injection locations grows, more routing trees must be maintained for a larger number of IDD participants. Injecting the outputs from regional models and Level II 88-D radar data, and customized subscriptions to each of these sources will also complicate IDD routing. In the near future, there may be many more IDD sites, as high-bandwidth Internet connections become ubiquitous. These trends point to a requirement for automatic routing to replace the current static and manual configuration for routing trees for each data feed. Dynamic and automatic routing will be necessary, but how to implement this remains an open problem.
5. Data archival The LDM was designed to handle outages of duration up to an hour or two, to make the delivery of near real-time data reliable in the face of network outages, power outages, congestion, and machine crashes. Its current design does not make it suitable for longer term archival of data.
6. Priority data The current LDM does not support a notion of product priorities, so small but urgent data products are given no processing priority over larger but less important data products.
Data pulling means that a data server returns data in response to a client request. When the server returns entire files, we have a distributed file system, which in the absence of concurrent updates, can be implemented in a straightforward manner. When the server actually understands the contents of the files, and can manipulate it on behalf of the client (for example by extracting subsets), we have a distributed data system. There are a number of such systems important to the meteorological community, for example the ADDE data system used in McIDAS. Here we limit ourselves to describing another data system we have worked closely with, the Distributed Ocean Data System (DODS).
DODS is a framework for developing distributed data servers and clients. It uses the HTTP and CGI protocols for client/server communications and data transport. Data files are named by URLs, although there are currently no built-in catalog services. DODS can subset data based on constraint expressions specified by the client.
DODS defines a data model that is a superset of the netCDF data model, adding structures, sequences (relational tables), functions (a sequence with dependent and independent variables), and grids (arrays with coordinate variables called map vectors). Attributes can be nested in containers and can be aliased. Metadata can be added transparently in auxiliary files. This general data model allows DODS servers to read data from a variety of data formats (e.g., netCDF, HDF, JGOFS, DSP, Matlab, FreeForm) and translate the data being served into the DODS data model. On the other end, the DODS clients translate from the DODS data model into the data model of the client API (i.e., netCDF, JGOFS, or DODS C++). These translations can be problematic when there are semantic mismatches between the original data and the client API.
As developers of meteorological applications in Java, Unidata has the opportunity to see data distribution from the point of view of data consumers, i.e., users. From the users perspective, we offer the following as some of the desirable characteristics of a future data distribution system:
Data naming Data should be named in ways that the user can understand.
Data discovery It should be possible to find out what's available, from characteristics of the data desired such as geospatial coverage and physical parameter, rather than having to know server names, URLs or directory or files names.
Data location transparency/optimization Naïve and casual users should be able to access data without being overly concerned about where it is located. Power users and system administrators should be able to specify background data movement to optimize routine or frequent tasks. Intelligent caching and replication should make moving small amounts of data transparent.
Data mobility Large data sets should be easily moved by the user from a remote server to local disk. The user should be able to estimate the cost (time/size) of moving.
Data subsetting It should be possible to access a subset or cross section of a large data set efficiently, without accessing all of the data in the large data set.
File format independence Data should be transparently converted from the underlying file format. The need to know the details of file formats should become as frequent as the need to know the IEEE floating-point format.
These goals can be characterized as the user asking to be isolated in an appropriate way from the low-level details of data distribution and storage. The user, whether research scientist or student, has a mental model of their task and the meaning of the data necessary to it. Successful applications present data options at the users semantic level, hiding the details of file names, storage location, etc. High-level features such as calculated data fields, image remapping, and coordinate transformations require that applications understand the semantics of the data, sometimes even better than the user.
The LDM is a message-oriented service, and knows nothing about the content of those messages beyond product identifiers to allow filtering and forwarding. Data servers such as DODS go much further in understanding the semantics of the data. DODS in particular has made great strides in mapping existing files and APIs to the DODS data model, giving applications access to a wide set of data. Unfortunately, much of the meaning embodied in existing data is implicit, and not stored explicitly in the file, the API, a set of conventions, or anywhere else accessible to application software. This implies that DODS, or any data server, will not completely succeed (certainly not in a general way) to provide data at the user semantic level, unless and until the community of data producers and consumers agree on metadata and the user-level semantics of the metadata.
The netCDF API, for example, provides 1D coordinate variables. But even distinguishing the spatial or temporal coordinates requires metadata conventions external to the file itself. While significant groups of data providers have developed conventions to solve some of their specific needs, these conventions are not generally recognized by application software. Nor do these conventions typically solve common problems in a general way, for example, coordinate systems that cannot be described with coordinate variables.
Humans are not constrained by the syntactic limitations of file formats and APIs, and so often understand the implicit semantics of the data. These semantics typically are made explicit in data decoding software, or in the visualization/analysis application itself. These applications are either specialized to certain data sets or have sections of code with switch statements that "just know" what the metadata means. The result is the dependence of applications on details of data formats and ad-hoc conventions.
The central challenge for future data systems is to assign user-level semantics to data in a way that can be used across applications. This requires community agreements and involves social factors as much as technical ones.
One promising technical avenue that Unidata and its collaborators are exploring is the specification of fundamental meteorological data structures. If we can evolve a small but functional set of such data descriptions, with the correct level of semantic content, we expect that software decoders would be written to produce such data structures from the underlying files. In other words, the semantics would be added by the software decoder and put into standard structures that libraries can add value to (e.g., projection transformations, unit conversions, etc.) and that applications can use in a general way. Although our effort uses Java classes and interfaces, language independent representations such as XML and RDF should also be possible.
We expect that the goals outlined above will be met by an evolution of current data distribution systems into a set of distributed data services on the Internet. The rapid development of high-speed networking and connections to the Internet will enable a transition from university-wide centralized machines to more distributed access across campuses and even at home. In this section we summarize some of the technologies that will likely be important in those evolutions.
Distributed systems There is much industry effort in developing distributed systems, especially Web based systems, as well as with distributed objects using CORBA and Java-based technology. One useful model of our data distribution system is that of a wide-area network of distributed objects. Web browsers, especially using Java applets, will obviously be important for visualization and data browsing.
Distributed service locators LDAP (Lightweight Directory Access Protocol) is a software protocol for locating services on the Internet, modeled after hierarchical file directories. Jini is a resource-discovery technology built on top of Java, that enables applications to discover resources in a way that tolerates failures and allows resources to dynamically come and go on the network. Jini is not directly applicable to wide area networks because it uses a broadcast message protocol, but some of its features could be used for a similar Internet-wide service.
Hybrid push/pull data systems Both push and pull architectures have strengths and weaknesses, and the features required by users need both types of services. While at the implementation level these systems might remain distinct, we expect that future data distribution systems will integrate both technologies.
Metadata standards Extensible Markup Language (XML) is a widely deployed syntax for putting structured data into ASCII text, allowing users to define their own data tags and attributes. Resource Description Framework (RDF) provides a standard way for using XML to represent metadata in the form of properties and relationships of items on the Web. The Dublin Core is a standard set of metadata elements for categorization and resource discovery. While none of these address problems specific to scientific data sets, they are some of the tools our community might make use of in developing our own metadata standards.
Educational metadata standards Instructional Management Systems (IMS) is a metadata standard for educational resources. This is important to the digital library community and educational software developers (and those making software for use in education like Microsoft, Adobe etc.)
Reliable Multicast IP One reason for postponing a solution to the automatic dynamic routing problem for the IDD was the hope that, by this time, multicast IP would be part of the infrastructure of the Internet. Although there has been much research on ways to implement "reliable multicast", so far we have not identified any implementations that are suitable for the LDMs requirements to buffer entire data products for retransmission after outages of minutes or hours, rather than buffering packet-sized units for periods of milliseconds.
We have described two widely used data distribution systems, Unidatas IDD and DODS. We described in detail the problems that the IDD is facing due to increasing data streams, both to highlight for our community coming changes and updates to the LDM software, and also to clarify and motivate our long-term evolution of the IDD design.
We described the DODS system as represent-ative of client/server distributed data systems. These kinds of systems are necessary for retrospective data access and for large data sets that cannot be replicated on everyones local disk or even campus-wide file server.
We described desirable features of future data distribution systems from the users point of view, and characterized them as requiring the presentation of data at the users semantic level, appropriately hiding details of data location and format. We claim that assigning these semantics in a general and reusable way is the most challenging problem facing future data systems, one that has both social and technical aspects.
We mentioned some of the technological developments and products that may be useful in developing future data systems, as well as reporting on the lack of success so far of reliable multicast in enabling efficient wide area data distribution.
In the short term, Unidata will upgrade the LDM to handle our growing data streams. In the medium term, we will redesign the LDM to remove fundamental limits and add functionality, as well as explore and prototype new technologies. In the long term, in collaboration with other developers, we will evolve data distribution into Internet-wide distributed data services that will better match users goals.
This paper is dedicated to the memory of Glenn Davis, whose vision and clarity of thought continue to guide Unidata software development.
Campbell, D. P. and R. K. Rew, 1988. "Design Issues in the Unidata Local Data Management System," Fourth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, California, Am. Meteor.Soc., 208-212.
Rew, R. K. and G. P. Davis, 1990. Distributed Data Capture and Processing in a Local Area Network, Sixth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, California, Am. Meteor.Soc., 69-72.
Davis, G. P., and R. K. Rew, 1994. The Unidata LDM: Programs and Protocols for Flexible Processing of Data Products, Tenth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Nashville, Tennessee, Am. Meteor.Soc., 131-136.
Baltuch, M S., 1997. "Unidatas Internet Data Distribution (IDD) System: Two years of data delivery," Proceedings, 13th International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography and Hydrology, Long Beach, California, Am. Meteor.Soc., 168-171.
Davis, E.R., J. Gallagher, "Using DODS to Access and Deliver Remote Data", Proceedings of the 15th International Conference on IIPS, Dallas Texas, American Meteorology Society, January 1999.
Web sites:
CORBA: http://www.omg.org/
DODS: http://www.unidata.ucar.edu/packages/dods/
Dublin Core http://purl.org/dc
IDD: http://www.unidata.ucar.edu/projects/idd/
IMS: http://www.imsproject.org
Jini: http://www.sun.com/jini/
LDAP: http://www.critical-angle.com/ldapworld/
LDM: http://www.unidata.ucar.edu/packages/ldm/
Reliable Multicast: http://research.ivv.nasa.gov/RMP/
| Contact Us Site Map Search Terms and Conditions Privacy Policy Participation Policy | ||||||
|
||||||