Sometimes a site will be running a correctly configured LDM and yet not be receiving all requested data. Or a site may receive data correctly most of the time but also experience possibly recurring episodes of missing data. Such symptoms often indicate network problems. Network problems are difficult because frequently there is very little control over them. This page is to help you diagnose and possibly treat these problems.
First, it is important that LDM site managers develop a good relationship with their system administrators. System administrators are among the best suited people to assist with network problems. Furthermore, if your site is a data relay site, it is important that these people be aware of the role that you are playing so that they can consider this when making changes affecting the network.
The LDM tries to deal with the problem of network congestion. In particular, if a product doesn't make it to its destination site, the upstream LDM will try to re-send it. This will continue until the product is successfully sent, or it becomes too old (over an hour, by default), or some other disabling event occurs such as a machine or network connection going down. Due to the fewer number of packets, it is more likely for a low-volume stream to be successfully transmitted than a high-volume stream.
We can break down this situation into three cases. First, if an LDM is down for just a brief time, say, a few minutes, it is possible that no data will be lost. Second, if an LDM is down for between a few minutes and one hour, data may or may not be lost, depending on whether the upstream and downstream site can reestablish which products should be sent and transfer the backlog of products in a timely manner. Third, if an LDM is down for over one hour, products over one hour old will be lost because data at the upstream site is, by default, expired after one hour. The success of transfer of products less than one hour old is the same as in the second case.
In any of these cases, the system should eventually be able to catch up and the problem of missing data should disappear. If it doesn't, the next two steps are to (1) evaluate your upstream feed then (2) evaluate the connection to your upstream feed.
Note that man pages are available for all the commands listed in this section and the following section. Be sure to check the man pages for detailed information about command options and output.
ping - This is a UNIX command. It sends a continuous stream of small packets to the named host, so that you can watch the quality of the connection over time. Upon completion it gives a few summarizing statistics. With ping you can see if a site is connected and responding. It also gives some indication of network delays and packet delivery reliability.
Example: ping host_name
Note that by default ping only sends very small packets. The reliability of transmission of larger packets may be different. See the man page for info regarding how to change the packet size.
ldmping - This is an LDM command. It checks to see if an LDM server is running on a remote host. It does not require the 'ping'ing site to be on the access control list of the upstream LDM (unlike notifyme, below), and thus can be used on any host. It should return the string "RESPONDING" in the state field. If it doesn't, check the man page to see what the state means.
Example: ldmping -i 5 -h host_name to ping every five seconds
Since network traffic varies over time, it can be useful to either run ldmping for a while, say 1/2 hour, or use it to sample at a variety of times.
notifyme - This is an LDM command to see if upstream host is receiving data. The remote host must have an 'allow' line for your site in the ldmd.conf file, so it also ensures that the upstream site is configured properly to send you products.
Example: notifyme -vl - -h host_name
notifyme returns (1) the current system time, (2) the string 'notifyme', (3) the product size, (4) the time the product was injected into the IDD, (5) the feed type, (6) a sequence number, and (7) the product ID with the actual product time. If you take the difference between the first and fourth fields, you can see how long it took the product to arrive at the host.
Although these are ways for you to get some information regarding your upstream feed, there is no way for you to determine what 'allow' entries your upstream feed has in its ldmd.conf file for your site. Thus, your upstream feed may be getting the data you want, but it may not be configured properly to send it to you. If you have ruled out network congestion (see below) and suspect this is the case, contact your upstream site.
If your upstream site is up but you have missing data, then you may have network
congestion problems. If, at this point you suspect network congestion, take
a look at the log. Execute the command: ldmadmin log
If high latency is occurring, the downstream LDM log will contain messages similar
to the following exchange between the upstream site 'rainbow' and the downstream
site 'shadow':
> Nov 14 20:53:13 shadow rainbow[15060]:
> RECLASS: 19971114195313.567 TS_ENDT
> {{DDPLUS, ".*"},{IDS, ".*"},{HDS, ".*"}}
> Nov 14 20:53:13 shadow rainbow[15060]:
> skipped: 19971114195019.062 (174.505 seconds)
> Nov 14 20:58:29 shadow rainbow[15060]:
> RECLASS: 19971114195829.792 TS_ENDT
> {{DDPLUS, ".*"},{IDS, ".*"},{HDS, ".*"}}
> Nov 14 20:58:29 shadow rainbow[15060]:
> skipped: 19971114195339.392 (290.400 seconds)
> Nov 14 21:04:06 shadow rainbow[15060]:
> RECLASS: 19971114200406.219 TS_ENDT
> {{DDPLUS, ".*"},{IDS, ".*"},{HDS, ".*"}}
> Nov 14 21:04:06 shadow rainbow[15060]:
> skipped: 19971114195920.680 (285.539 seconds)
In this case the RECLASS message means that rainbow has sent shadow data that is older than shadow wishes to accept. Shadow responded by sending the upstream site the RECLASS message. The purpose of the RECLASS message is to reestablish which products should actually be sent, including information about the desired time ranges. Also, the 'skipped' entries indicate that shadow is throwing away the old products it received.
For more information on the RECLASS and 'skipped' messages, click here.
Here are a few tools to help in evaluating the quality of the connectivity to your upstream site:.
traceroute - Use this UNIX command to identify the route through which information flows from one host to another. It will also show where slow links are located. After using traceroute to determine the intermediate hops between your machine and the destination machine, you can use ping to test where a source of packet loss may be located.
Example: traceroute host_name
traceroute picks a path to the named host, then, by default, sends three probes to each gateway along the path. Each line of output displays the timing results of these three probes. By default, timing results greater than five seconds are reported by displaying a '*' for that probe. See the man page for more information and changing default values.
ftp - This is the UNIX file transfer utility. Grabbing a junk file can give you another perspective of the degree of network congestion. This requires that you know of, and have access to, a file to transfer.
Example: ftp host_name, then follow the protocol to transfer a file.
Example: netcheck
Once network congestion has been confirmed, only a limited set of options are available. Here at the UPC we can make some changes in the IDD topology based on empirical knowledge of the network. However, this solution can be labor intensive and is limited due to site resources.
If your network connection is bad, you will need to show the problem to your system administrator or ISP. The submission of the network problem must come from the customer, so the UPC cannot submit the request on your behalf. Since network problems can be intermittent, it's important to get your system administrator or ISP involved when you can actually show them the problem.
You can increase the '-m max_latency' parameter to rpc.ldmd to allow the upstream site more time to deliver. You would have to increase your queue size proportionately. Of course, this really won't do much good unless the upstream site has a correspondingly large product-queue. Otherwise, the products will be garbage-collected before they are able to be sent. This would allow you to get the data, albeit in a less timely manner. This is not a great solution.
Another thing you can try is to split the feed requests of your LDM so that multiple connections are used. For example, in etc/ldmd.conf change
torequest DDPLUS|IDS|HDS ".*" rainbow.eas.purdue.edu
request DDPLUS ".*" rainbow.eas.purdue.edu request IDS ".*" rainbow.eas.purdue.edu request HDS ".*" rainbow.eas.purdue.edu
Because of the multiple entries, the LDM will fork off a separate process and, thus, a separate connection for the three feeds. The parallel connections sometimes result in greater throughput.
In the long run, however, you may need to ask for less data or get a fatter pipe. It is worth considering whether you can get by with less data. Sites often request more data than they actually use. Or, NSF provides grants for equipment that might include hardware to upgrade network connections. Such grants are announced on the www.unidata homepage. You can also see an old announcement from the NSF for more information.
A last resort would be to purchase or build your own direct satellite receiver system for acquiring NOAAport data and feeding these data to your LDM. Some of the required components include: a satellite antenna, an ingest card, a PC, some software, and a demodulator for each channel desired. Contact support@unidata.ucar.edu for more information. This option will not help with non-NOAAPORT data feeds, such as NIDS or NLDN.