Missing Data and Latency Problems

Sometimes a site will be running a correctly configured LDM and yet not be receiving all requested data. Or a site may receive data correctly most of the time but also experience possibly recurring episodes of missing data. Such symptoms often indicate network problems. Network problems are difficult because frequently there is very little control over them. This page is to help you diagnose and possibly treat these problems.

First, it is important that LDM site managers develop a good relationship with their system administrators. System administrators are among the best suited people to assist with network problems. Furthermore, if your site is a data relay site, it is important that these people be aware of the role that you are playing so that they can consider this when making changes affecting the network.

The LDM tries to deal with the problem of network congestion. In particular, if a product doesn't make it to its destination site, the upstream LDM will try to re-send it. This will continue until the product is successfully sent, or it becomes too old (over an hour, by default), or some other disabling event occurs such as a machine or network connection going down. Due to the fewer number of packets, it is more likely for a low-volume stream to be successfully transmitted than a high-volume stream.

Missing Data Upon LDM Start-Up

Missing data is not uncommon when stopping and starting an LDM. This is because upon LDM start-up the upstream and downstream sites must reestablish which products should be sent to the downstream site. Then products must be transferred, including a possible backlog of products that accumulated while the site was down.

We can break down this situation into three cases. First, if an LDM is down for just a brief time, say, a few minutes, it is possible that no data will be lost. Second, if an LDM is down for between a few minutes and one hour, data may or may not be lost, depending on whether the upstream and downstream site can reestablish which products should be sent and transfer the backlog of products in a timely manner. Third, if an LDM is down for over one hour, products over one hour old will be lost because data at the upstream site is, by default, expired after one hour. The success of transfer of products less than one hour old is the same as in the second case.

In any of these cases, the system should eventually be able to catch up and the problem of missing data should disappear. If it doesn't, the next two steps are to (1) evaluate your upstream feed then (2) evaluate the connection to your upstream feed.

Evaluating Your Upstream Feed

If you notice missing data, it may be due to a problem at your upstream feed. Here are a few tools to help determine if your upstream site is actually up, if the LDM is running on that site, if it is receiving data, and if it is configured to send data to your downstream site.

Note that man pages are available for all the commands listed in this section and the following section. Be sure to check the man pages for detailed information about command options and output.

Although these are ways for you to get some information regarding your upstream feed, there is no way for you to determine what 'allow' entries your upstream feed has in its ldmd.conf file for your site. Thus, your upstream feed may be getting the data you want, but it may not be configured properly to send it to you. If you have ruled out network congestion (see below) and suspect this is the case, contact your upstream site.

Evaluating the Connection to Your Upstream Site

If your upstream site is up but you have missing data, then you may have network congestion problems. If, at this point you suspect network congestion, take a look at the log. Execute the command: ldmadmin log. If high latency is occurring, the downstream LDM log will contain messages similar to the following exchange between the upstream site 'rainbow' and the downstream site 'shadow':

      > Nov 14 20:53:13 shadow rainbow[15060]:
      >       RECLASS: 19971114195313.567 TS_ENDT
      >                {{DDPLUS,  ".*"},{IDS,  ".*"},{HDS,  ".*"}}
      > Nov 14 20:53:13 shadow rainbow[15060]:
      >        skipped: 19971114195019.062 (174.505 seconds)
      > Nov 14 20:58:29 shadow rainbow[15060]:
      >       RECLASS: 19971114195829.792 TS_ENDT
      >               {{DDPLUS,  ".*"},{IDS,  ".*"},{HDS,  ".*"}}
      > Nov 14 20:58:29 shadow rainbow[15060]:
      >       skipped: 19971114195339.392 (290.400 seconds)
      > Nov 14 21:04:06 shadow rainbow[15060]:
      >       RECLASS: 19971114200406.219 TS_ENDT
      >               {{DDPLUS,  ".*"},{IDS,  ".*"},{HDS,  ".*"}}
      > Nov 14 21:04:06 shadow rainbow[15060]:
      >       skipped: 19971114195920.680 (285.539 seconds)
    

In this case the RECLASS message means that rainbow has sent shadow data that is older than shadow wishes to accept. Shadow responded by sending the upstream site the RECLASS message. The purpose of the RECLASS message is to reestablish which products should actually be sent, including information about the desired time ranges. Also, the 'skipped' entries indicate that shadow is throwing away the old products it received.

For more information on the RECLASS and 'skipped' messages, click here.

Here are a few tools to help in evaluating the quality of the connectivity to your upstream site:

Options

Once network congestion has been confirmed, only a limited set of options are available. Here at the UPC we can make some changes in the IDD topology based on empirical knowledge of the network. However, this solution can be labor-intensive and is limited due to site resources.

If your network connection is bad, you will need to show the problem to your system administrator or ISP. The submission of the network problem must come from the customer, so the UPC cannot submit the request on your behalf. Since network problems can be intermittent, it's important to get your system administrator or ISP involved when you can actually show them the problem.

You can increase the '-m max_latency' parameter to ldmd(1) to allow the upstream site more time to deliver. You would have to increase your queue size proportionately. Of course, this really won't do much good unless the upstream site has a correspondingly large product-queue. Otherwise, the products will be garbage-collected before they are able to send. This would allow you to get the data, albeit in a less timely manner. This is not a great solution.

Another thing you can try is to split the feed requests of your LDM so that multiple connections are used. For example, in etc/ldmd.conf change

      request DDPLUS|IDS|HDS  ".*"    rainbow.eas.purdue.edu
    

to

      request DDPLUS  ".*"    rainbow.eas.purdue.edu
      request IDS     ".*"    rainbow.eas.purdue.edu
      request HDS     ".*"    rainbow.eas.purdue.edu
    

Because of the multiple entries, the LDM will fork off a separate process and, thus, a separate connection for the three feeds. The parallel connections sometimes result in greater throughput.

In the long run, however, you may need to ask for less data or get a fatter pipe. It is worth considering whether you can get by with less data. Sites often request more data than they actually use. Or, NSF provides grants for equipment that might include hardware to upgrade network connections. Such grants are announced on the UPC homepage. You can also see an old announcement from the NSF for more information.

A last resort would be to purchase or build your own direct satellite receiver system for acquiring NOAAport data and feeding these data to your LDM. Some required components include: a satellite antenna, an ingest card, a PC, some software, and a demodulator for each channel desired. Contact support@unidata.ucar.edu for more information. This option will not help with non-NOAAPORT data feeds, such as NIDS or NLDN.