Unidata Internet Data Distribution (IDD): A 1994 Update

Ben Domenico
Dave Fulker

1. INTRODUCTION¹

One year ago, we described an initiative, Internet Data Distribution or IDD², for disseminating real-time weather data and related information to universities via Internet. At that time, a few adventurous departments were beginning a field test of the system. This test, which expanded to 32 sites, is now concluding. We summarize here what we have learned, changes we have made to the system, the current state of IDD, and plans for future deployment.

While technological lessons learned have been crucial to making IDD reliable, experience gained by working with many independent organizations to build a truly distributed system has been equally valuable.

2. SYSTEM OVERVIEW

The primary goal of IDD is to deliver data reliably to networks of computers in departments of participating universities. Recipients specify data products of interest and IDD delivers them to the local site as soon as possible after the data are available at the source.

2.1 Any Site Can Inject Data

IDD allows any site to act as a data source. Each source uses the Unidata LDM (Local Data Manager) software to send data to a few "downstream" nodes which, in turn, relay data to others, using the same software.

2.2 Fan Out for Scalability

Viewed from each source, IDD employs a heirarchical distribution. This design provides scalability. Because the number of branches to other relays or to endusers (i.e., leaf nodes) supported by each relay is relatively small and independent of the total number of data recipients, the system scales indefinitely, as long as:

there are enough sites willing to serve as relays,
the computers running the LDM at all relay nodes are reasonably dependable,
the LDM software delivers data reliably even with typical problems in underlying networks, and
alternative routes can be employed when network and computer failures occur.

2.3 Table-Driven Control

Each LDM functions like a switch, receiving data products and routing them to appropriate locations. This behavior is controlled by a local table of pattern-action pairs which are referenced upon the arrival of a new data product. Each IDD product carries an identifier string--defined at the point of injection to be useful for data selection³--which is compared to all patterns in the table.

If the identifier matches a given pattern, the corresponding action defines how the data product is to be handled by the LDM, and multiple matches will yield multiple actions. The patterns are stored in the table as Unix regular expressions, a generalized form which supports a broad range of user options for controlling the data flow. Supported actions include storing the data in local files (whose names are also described as regular expressions) and initiating processes such as decoders. Thus the LDM supports event-driven processing: users can perform display updates, initiate model runs, or do any other task upon the arrival of new data which match a given pattern.

Implemented with tables to prevent unauthorized access, each LDM also allows neighboring LDMs to register requests for data (i.e., subscriptions). These are manifest as additional pattern-action pairs, where the action is to relay data to the subscriber. This is the primary device for routing data to relays and leafs.

2.4 Queuing for Reliable Performance

The LDM software employs a queuing mechanism for elasticity. When relaying data across the Internet, congestion and (brief) outages can make it impossible to deliver products immediately, so a degree of elasticity is required to avoid product loss. Each product received by an LDM relay is placed in a "product queue" for subsequent delivery to other nodes, and the current implementation of this allows downstream network and machine outages of tens of minutes to occur without causing product losses.

This feature, combined with the use of a reliable underlying transport protocol, namely, TCP, yields an overall level of reliability that we believe will satisfy user requirements.

3. PUSH-PULL ALTERNATIVES FOR DATA DELIVERY

3.1 Data by Subscription

A distinguishing characteristic of the Unidata approach is that IDD allows users to specify, in advance, which data should be delivered to their local systems, and data are delivered as soon as they are available. Thus IDD is a data subscription service, implemented on an event-driven basis. A newspaper subscription is a good analogy: a priori, subscribers know that they will be interested in some portion of the information from a given source, and they want those data delivered to their premises as early as possible, on a regular basis. The analogy fails in some respects: LDM data selection criteria allow discrimination at the level of articles within the paper, and subscriptions to various IDD sources may or may not entail costs to the subscriber.

3.2 Data on Demand

In contrast, typical Internet-based data services are analogous to bookstores and libraries. In these cases, the provider assembles a collection of information that may be of interest, and the client peruses the offerings before deciding what to "bring home" for further study.

Thus the IDD "pushes" data from a source to a number of subscribers whereas a data center typically allows the user to "pull" data on demand. These models for providing information are not mutually exclusive. In the world of print media they coexist and complement one another, and the same can be true for electronic forms of provision. The subscription-style service of IDD complements those many systems already in place that deliver atmospheric and related data on demand.

A data center can use IDD to populate an archive with the latest data from a variety of observing systems. End-users of these data have two alternatives: they may subscribe to the same products (in real time) or they may acquire them (later) from the archive. Of course the data center also can function as an IDD source, possibly creating value-added products from real-time inputs and distributing them immediately upon creation.

In fact, IDD facilitates this. The LDM software can push the data not just into local holdings but also into local processes, thus supporting the creation of transformed or value-added data streams (in real time) or the tailoring of local holdings in unlimited ways.

4. PHILOSOPHY AND ROLES

IDD is a distributed system with interacting parts--data sources, relays and sinks--operated by Unidata universities at various locations across the nation. Thus it is a community endeavor, coordinated by the Unidata Program Center. Responsibilities for running and maintaining IDD are distributed as follows.

4.1 Unidata Program Center (UPC) Roles

LDM Software - The UPC develops, maintains, and supports the LDM software, which serves as linchpin for IDD, on an ongoing basis.
High-Interest Data - The UPC arranges for certain data of very high interest (determined by actual use) to be injected into IDD for use by U.S. universities at no cost or at discounted rates.
Routing - Routes (i.e., relay configurations) for high-interest data are chosen by the UPC.
Coordination - Overall coordination and policy setting for IDD are performed by the UPC in consultation with advisory committees.

4.2 University Roles

LDM Computers - Universities acquire and maintain the computers which, by running LDM software, serve as data relays and data sinks for IDD.
Internet Connections - Universities maintain connections to the Internet with adequate capacities to carry all data of local interest and often to relay these data to other users.
Cooperation - Universities relay data to one another and follow UPC guidelines on routing and other matters of IDD policy.
New Data - New types of data or other data that are of lesser interest to the community generally are injected into IDD by universities. Permissions to use such data are negotiated between the providers and recipients without UPC involvement.
Ad Hoc Routing - Universities (data providers and data recipients) negotiate their own routing arrangements for data of lesser interest, i.e., that are not in the high-interest category.

5. DATA AVAILABLE VIA IDD

IDD supports multiple sources for data, and these may vary in the number of receivers and the degree of responsibility the UPC holds for their acquisition and routing.

5.1 Traditional Data Sources

Using satellite earth stations, Unidata users have had access to Family of Services data (prepared by the National Weather Service and broadcast by Alden, Inc.) since 1985. Since 1988 this same system also conveyed a collection of Geosynchronous Orbit Environmental Satellite (GOES) images and related data prepared by the Space Science and Engineering Center (SSEC) at the University of Wisconsin-Madison.

Hence these data are in widespread use, and the IDD is intended potentially to deliver them to every Unidata university. To achieve this, SSEC and Alden (under contract) are providing data and injecting them into the IDD system. That is, SSEC and Alden run LDM ingester and relay systems on their own premises and, through the Internet, transmit data to those top-level relays designated by the UPC.

The remainder of the distribution network for each of these data flows is run and maintained by members of the Unidata community, following guidance from the UPC on route selection. This gives the entire Unidata community no-cost access to all Family-of-Services data (DIFAX excepted) and to certain GOES data, all in real time. This is not really without cost, of course, because all users must maintain network connections adequate to receive the data they desire, and many users employ extra computing and network resources to relay data to other users. Clearly, IDD is a community endeavor requiring a high level of cooperation.

5.2 New Data Sources

In addition to the above, Unidata has contracted with WSI, Inc. to use IDD for giving universities access to data from the NEXRAD Information Dissemination System (NIDS). This is a new data source for universities and, at this writing, a few sites already are receiving images from the newest radars deployed by the National Weather Service. Unlike the Family of Services data, NIDS may be acquired through IDD only by those universities that have paid subscriptions with WSI for specified products.

The Forecast Systems Laboratory of NOAA/ERL is working with Unidata to provide mesoscale model outputs and related data to universities via IDD on an experimental basis. In fact, FSL data were the first to be conveyed regularly over an extended period using IDD methods, and this provided proof of principle. Subsequently, FSL extended Unidata a grant to help fund testing and continued development of IDD technologies.

Two other organizations are using IDD to disseminate experimental datastreams. The State University of New York in Albany, in cooperation with Geomet, Inc., is delivering the lightning data (from the National Lightning Detection Network) to a number of sites, and the University of Michigan is using IDD to create "mirror" sites for its BlueSkies server of processed weather products targeted toward K12 education.

6. LESSONS FROM TESTING

At this writing we are near the conclusion of an IDD field test involving nearly 30 sites. The primary technical problems uncovered and resolved during the test were:

A reliable data delivery system has to be able to cope with very high latency in some of the underlying Internet connections. Periods where the network essentially " drops out" for many minutes at a time are not unusual.
The fact that gathering statistics regarding network performance can impact the performance of the network itself if not done carefully.

6.1 Network Latency

The unpredictable response time of the Internet between some of the test nodes necessitated a substantial redesign of the Unidata LDM software. The initial test system, LDM 4.0, delivered and received data quickly and effectively over reliable network connections. However, network congestion and outages revealed two undesirable characteristics. First, if delivery to any node was slow, delivery also slowed to other nodes fed by the same relay. Second, when such problems were severe, the relay ultimately would fail to acquire products from the upstream node. That is, one congested Internet link could cause delays and data losses at nodes that were otherwise unaffected by the slow link.

The release of LDM 4.1 incorporates a product queue to store incoming products as discussed in Section 2.4. The queue easily can be made large enough to allow for outages of tens of minutes. The new design employs independent processes to feed downstream nodes, so one slow link does not affect other sites, though delivery obviously is delayed to sites which depend on the link. Even there, products are lost only when the queue nears capacity and the connection must be severed.

This may be an indication that the site is requesting more data than its network connection can deliver. In such cases, network bandwidth should be increased or data requests should be limited.

Testing of LDM 4.1 is underway at this writing, and the results are promising. In one case, data products were still being delivered reliably (albeit over an hour late) in spite of network problems that resulted in nearly a third of all IP packets being lost.

6.2 Fine Tuning

Broader deployment of the IDD will require fine tuning of various LDM parameters, including those that relate to full product queues. An up-to-date LDM description is available via the Unidata World Wide Web server.⁴

6.3 Performance Monitoring

Early in the field test, measurements on the reliability of relays were made by moving verbose system logs among LDM systems to permit comparisons. The UPC assembled the results into hourly and daily reports. This approach was cumbersome and proved unreliable when the relay computers and network links were heavily loaded. Of course those are the times when performance statistics are most important.

Newer versions of the statistics system rely on time stamps for each product created at the point of IDD injection. Receiving nodes use this to count the products in hourly bins, and the counts are forwarded to the UPC for comparisons with counts made at the source. This system appears to operate accurately and nonintruseively. It also is described in the Unidata WWW server.⁵

7. FULL DEPLOYMENT

The IDD field test now includes twenty-eight sites, most of which will become relay nodes for full-scale IDD deployment. This means that most relay nodes will have had considerable experience as field test sites. We have now received over 40 applications for participation in a rapid deployment effort, which will add new sites on a first-come/first-served basis upon the production release of LDM 4.1. A more detailed description of the rapid deployment plan is available via WWW⁶ along with lists indicating the planned order of deployment7 and the routing topology for Family of Services8 and other data streams. Hourly summaries showing the percentage of products delivered to each site also are available.⁹

8. ACKNOWLEDGMENTS

As noted earlier, this is truly a community endeavor. The IDD field test sites have been diligent and patient during a period of rapid change in both the LDM software and the programs for monitoring system performance. In particular, users at the top-level relay nodes have been involved in the test from the outset and have born the brunt of the problems. These include Harry Edmon at the University of Washington and John Kemp at the University of Illinois (whose systems are serving as data recovery sites as well as top level relay nodes), Steve Chiswell at North Carolina State University, Ron Henderson at the State University of New York at Albany, and Tracy Mullen at the University of Michigan.

Current Unidata funding from the National Science Foundation Atmospheric Sciences Division supports the UPC, including subcontracts with data providers, and also offers periodic hardware grant opportunities for the participating universities. The Internet Data Distribution initiative was first described in Unidata: 1993-1998, a proposal to NSF/ATM. Besides the ongoing support from NSF ATM and the continued contributions from our user community, Unidata has received monetary support from the Forecast Systems Laboratory of NOAA/ERL.

ENDNOTES

1. The Unidata Program Center is sponsored by the National Science Foundation and managed by the University Corporation for Atmospheric Research. Mention of a commercial company or product does not imply endorsement.

2. Domenico, Ben, Sally Bates, Dave Fulker, The Unidata Internet Data Distribution (IDD) System, AMS Symposium on Education, January 1994.

3. For conventional weather data the identifier string is the standard World Meteorological Organization (WMO)header.

4. WWW URL: http://www.unidata.ucar.edu/packages/index.html (no longer exists)

5. WWW URL: http://www.unidata.ucar.edu/projects/idd/statsplan.html (no longer exists)

6. WWW URL: http://www.unidata.ucar.edu/projects/idd/plans/deployment.html

7. WWW URL: http://www.unidata.ucar.edu/projects/idd/plans/queues.html

8. WWW URL: gopher://groucho.unidata.ucar.edu/00/systemsidd/test.info/fos.topo (no longer exists)

9. WWW URL: gopher://groucho.unidata.ucar.edu/00/systems/idd/ldmstats.nodes (no longer exists)

This page was Webified by Jennifer Philion.
Questions or comments can be sent to <support@unidata.ucar.edu>.

This page was updated on .