Unidata has made available a new version of the LDM (Local Data Manager) that has the ability to handle data at a significantly higher rate than the previous version of the software. Below we describe scalability problems with the previous version and a little of the computer science that went into the design of the new version.
Although the previous LDM has been dealing with the load of distributing data on the IDD (Internet Data Distribution) system, scalability problems loom on the horizon. Before explaining those problems, we first need a little terminology.
A data product is the smallest unit of data dealt with by the LDM, and represents a single named data item, such as the report of an observation, a gridded data field, or an image. Data products can be as small as a few dozen bytes for a short bulletin or as large as many megabytes for a satellite image. The data streams commonly distributed to most IDD sites comprise almost 500,000 products per day, making it impractical to store each product as a single file. Data products from new data sources (radar data, model output, imagery) will increase the daily stream to nearly 1.5 million products per day by next year.
Each LDM system has a local data store (implemented as a single large file) called the product queue, which is shared among the LDM programs that get data from upstream nodes, read and process the data, and send data on to downstream nodes. Old products are expired out of the product queue when they are no longer needed. Product queues have typically held about an hour's worth of data from most data streams, though they can be configured to hold data from different data streams for different intervals.
The amount of time it previously took to insert or delete a product was dependent on the number of products in the product queue. When product queues were smaller than about 10,000 products this time was negligible, because the coefficient of the linear term is small. However, as the rate of data products has increased, this term has become very significant, to the point that the previous LDM system had difficulty keeping up when it had 50,000 or more products in its queue. This resulted in delayed data to downstream sites and even data loss if there were sustained periods during which the LDM could not insert data into the product queue as fast as it became available. At the Unidata Program Center we saw a dramatic demonstration of this problem when we tried running an LDM system to inject all data products, including radar data and some experimental data streams not yet generally available. An LDM program that was supposed to delete old products from the queue could not delete them as fast as new products were added, so data products were lost for lack of space in the queue.
The problem of designing a data structure to support the operations of search, insert, and delete efficiently is a classic computer science problem. Solutions include "balanced-trees" such as the AVL tree, the B-tree, and the red-black tree. In designing the new LDM product queue, we chose a relatively recent computer science discovery, the skip list [1], that provides a simpler and faster way to achieve high performance by relying on "probabilistic balancing".
The results of adding this technology to the LDM are dramatic improvements to the time needed to insert, delete, and find products in a large product queue. For example, it takes only 15 seconds for the new LDM compared with over 12 minutes for the previous LDM to delete the products out of a test queue.
There are two additional benefits of the new LDM design. First, it is no longer
necessary to run the pqexpire program, which is the worst current
bottleneck when there are a lot of products in the queue. Although pqexpire
still works with the new version, we recommend that new LDM systems be configured
to run without pqexpire, deleting old products as needed when new
products arrive. Second, there is no longer a need to specify an arbitrary limit,
such as one hour, for the time that products will reside in the queue. The size
of the queue specified when it is created will determine how long products will
stay in the queue, because they will only be deleted when the space they occupy
is needed for new products. This will provide additional elasticity for the
IDD and will mean that a significant amount of data remains available for processing
even if connectivity is lost to upstream data sources.
An additional LDM improvement is support for much larger product queues. Previously, product queue sizes were limited to about 2 Gbytes because of the signed 32-bit type used to represent a file offset on most systems; however we have eliminated this limit to support much larger product queues by modifying the LDM for use on architectures that support 64-bit file offsets. With larger product queues, a data archive could conceivably be represented and accessed as an LDM product queue, providing a retrospective data access for other LDMs.
References:
[1] W. Pugh. ``Skip Lists: A Probabilistic Alternative to Balanced Trees''. CACM vol. 33, p. 668-676, June 1990.