Monitoring the LDM System

Contents

System CPU load

The top and uptime utilities can be used to monitor the system CPU load:

      top
      ...
      uptime
      ...
    

System I/O load

The top and iostat utilities can be used to monitor the system I/O load:

      top
      ...
     iostat
      ...
    

Data-product latency

Using the IDD rtstats webpages

If yourLDM is a node on the IDD and is a gateway LDM, then the most convenient way to monitor data-product latency is to go to the IDD rtstats webpages, select your computer, find the feedtype in which you're interested, and then select the latency plot.

Using ldmadmin watch

The ldmadmin utility can be used to monitor the data-product latency of incoming data:

      ldmadmin watch
      (Type ^D or ^C when finished)
      ...
    

The output is in the form

      MMM DD hh:mm:ss pqutil: nbytes YYYYMMDDhhmmss.sss ft seqno pid
    

where:

MMM DD hh:mm:ss
is the month, day, hour, minute, and second when the line was printed.
nbytes
is the size of the data-product in bytes.
YYYYMMDDhhmmss.sss
is the data-product creation-time of the data-product.
ft
is the feedtype of the data-product.
seqno
is the sequence number of the data-product (and usually unimportant).
pid
is the data-product identifier of the data-product.

By comparing the two timestamp fields, one can get an idea of the data-product latency.

The product-queue

The pqmon utility can be used to monitor the product-queue.For example

      pqmon
      Oct 27 16:48:28 pqmon: Starting Up (19969)
      Oct 27 16:48:28 pqmon: nprods nfree  nempty      nbytes  maxprods  maxfree  minempty    maxext  age
      Oct 27 16:48:28 pqmon:  70301    12  417968  1999781144    234870     1293    253410     92040 2867
      Oct 27 16:48:28 pqmon: Exiting
    

The above shows 70301 data-product slots that each refer to a data-product (nprods); 12 slots refer to gaps (i.e., contiguous empty space) in the product-queue(nfree); and 417968 slots that refer to nothing at all (nempty). The total number of slots is 488281 (nprods + nfree + nempty). The maximum number of slots that refer to data-products since the product-queue was created is 234870 (maxprods). Similarly, the maximum number of slots that reference a gap is 1293 (maxfree) and the minimum number of empty slots is 253410 (minempty). The size of the largest gap currently in the product-queue is 92040 bytes (maxext) and the age of the oldest data-product in the queue is 2867 seconds (age). Because this product-queue is known to have been active for quite some time (several months) the large number of empty slots means that it was created with an unnecessarily large parameter specifying the maximum number of data-products. The overhead of managing the queue could be slightly reduced by recreating the queue with a smaller number of slots (e.g., 250000).

LDM availability on an upstream host

The ldmping utility can be used to determine the availability of an upstream LDM. For example

      ldmping -i 0 hostname
      MMM DD hh:mm:ss      State    Elapsed Port   Remote_Host           rpc_stat
      MMM DD hh:mm:ss      state       time port      hostname           rpcMsg
    

where:

MMM DD hh:mm:ss
is the current month, day, hour, minute, and second.
state
is the state of the upstream LDM:
NAMED
The hostname couldn't be converted into an IP address.
SVC_UNAVAIL
An LDM is not running on the upstream host on the expected port (both port 388 and the upstream host's portmapper will have been tried).
ADDRESSED
An LDM is running on the upstream host on the expected port but we're not allowed to connect to it (i.e., there's no ALLOW entry for our LDM in the configuration-file of the upstream LDM).
RESPONDING
An LDM is running on the upstream host on the expected port and we're allowed to connect to it.
time
is the amount of elapsed time.
port
is the port number. This is only valid if the state is RESPONDING.
hostname
is the name of the upstream host.
rpcMsg
is the associated message (if any) from the RPC layer.

If the state of the upstream LDM is anything other than RESPONDING, then an LDM on the computer on which the ldmping utility was executed will not be able to receive any data-products.

If an ldmping to the upstream LDM shows no problems, then the notifyme utility can be used to determine what an downstream LDM connecting to the upstream LDM should receive:

      notifyme -vl- -h hostname
    

Monitoring a downstream LDM on the local system

You can monitor a downstream LDM process that is executing on the local system by setting its logging-level to verbose, at which time it will print the data-product metadata of every data-product that it receives to the LDM logfile. The logging-level of a downstream LDM process can be changed by sending it a SIGUSR2 signal via the kill utility, e.g.

      kill -s USR2 pid
    

where pid is the process-ID of the downstream LDM process, which can be discovered by searching the LDM logfiles for the relevant "Starting Up" message, e.g.,

      cd $HOME/logs
      grep -Fi 'Starting Up' `ls -rt ldmd.log*`
      ...
    

Monitoring an upstream LDM on the local system

You can monitor an upstream LDM process that is executing on the local system by setting its logging-level to debug, at which time it will print the data-product metadata of every data-product that it sends to the LDM logfile (along with other debugging information). The logging-level of an upstream LDM process can be changed by sending it a SIGUSR2 signal via the kill utility, e.g.,

      kill -s USR2 pid
    

where pid is the process-ID of the upstream LDM process, which can be most easily discovered via the uldbutil utility:

      uldbutil
      ...
    

or, less conveniently, by searching the LDM logfiles for the relevant "Starting Up" message:

      cd $HOME/logs
      grep -Fi 'Starting Up' `ls -rt ldmd.log*`
      ...
    

The state of LDM network connections

The non-standard utility netstat(1) is, nevertheless, available on many UNIX platforms and can be used to show the state of network connections. For example, here is the output of a netstat(1) command on a computer at the Unidata Program Center that's running FreeBSD:

      netstat -a -f inet -p tcp | awk 'NR<=2 || /ldm/'
      Active Internet connections (including servers)
      Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
      tcp4       0      0  emo.4392               flip.ldm               TIME_WAIT
      tcp4       0      0  emo.ldm                storm.sjsu.edu.39415   TIME_WAIT
      tcp4       0      0  emo.ldm                storm.sjsu.edu.39413   TIME_WAIT
      tcp4       0  33304  emo.ldm                storm.sjsu.edu.36000   ESTABLISHED
      tcp4       0  33304  emo.ldm                storm.sjsu.edu.35999   ESTABLISHED
      tcp4       0     44  emo.ldm                storm.sjsu.edu.35998   ESTABLISHED
      tcp4       0    928  emo.ldm                storm.sjsu.edu.35997   ESTABLISHED
      tcp4       0  33304  emo.ldm                storm.sjsu.edu.35996   ESTABLISHED
      tcp4       0  33304  emo.ldm                storm.sjsu.edu.35995   ESTABLISHED
      tcp4       0  33304  emo.ldm                storm.sjsu.edu.34569   ESTABLISHED
      tcp4       0   5828  emo.ldm                storm.sjsu.edu.34562   ESTABLISHED
      tcp4       0     44  emo.ldm                storm.sjsu.edu.34560   ESTABLISHED
      tcp4       0  23900  emo.ldm                storm.sjsu.edu.34561   ESTABLISHED
      tcp4       0  22240  emo.ldm                solon.meteoro.uf.2861  ESTABLISHED
      tcp4       0     44  emo.ldm                bigbird.tamhsc.e.50860 ESTABLISHED
      tcp4       0      0  emo.ldm                bigbird.tamhsc.e.50858 ESTABLISHED
      tcp4       0   8640  emo.ldm                bigbird.tamhsc.e.50857 ESTABLISHED
      tcp4       0      0  emo.ldm                bigbird.tamhsc.e.50856 ESTABLISHED
      tcp4       0     44  emo.ldm                bigbird.tamhsc.e.50855 ESTABLISHED
      tcp4       0     44  emo.ldm                bigbird.tamhsc.e.50854 ESTABLISHED
      tcp4       0      0  emo.1066               desi.ldm               ESTABLISHED
      tcp4       0     28  emo.1065               jackie.ldm             ESTABLISHED
      tcp4       0      0  emo.1064               thelma.ucar.edu.ldm    ESTABLISHED
      tcp4       0      0  emo.1063               thelma.ucar.edu.ldm    ESTABLISHED
      tcp4       0     28  emo.1062               shemp.ldm              ESTABLISHED
      tcp4       0      0  emo.1061               desi.ldm               ESTABLISHED
      tcp4       0      0  emo.1060               jackie.ldm             ESTABLISHED
      tcp4       0      0  *.ldm                  *.*                    LISTEN
    

This output assumes that the string "ldm" was associated with port 388 during the LDM Preinstallation Steps. You might have to adjust the above command to suit your operating system.

The last line of the output shows the top-level ldmd listening for TCP connections on port "ldm" (alias 388). The output also shows nineteen regular upstream LDMs (identified by having "emo.ldm" as the local address). The connections of two of these upstream LDMs are in TIME_WAIT and the associated processes should terminate soon. Seven active downstream LDMs are also shown near the end of the output. Lastly, the connection whose local address is "emo.4392" is special. This connection is due to the rtstats(1) process sending statistics to the Unidata Program Center computer "rtstats.unidata.ucar.edu" (alias "flip").

Metrics

If your computer system has thetop, netstat, uptime, and vmstat utilities installed, and you have configured the /metrics parameters in the LDM registry correctly, then you can periodically accumulate LDM performance metrics in a file for subsequent display and analysis by executing the addmetrics command of the ldmadmin utility from a crontab entry. See Edit the LDM user's crontab(1) file.

Additionally, if your computer system has the gnuplot utility installed, then you can plot the LDM performance metrics by executing the plotmetrics command of the ldmadmin utility.

Use theLDM logfile

The LDM logfile is your friend. If you encounter a problem, then one of the first things you should do is to look at it. Problems can often be diagnosed by comparing corresponding logfile entries from the upstream LDM and the downstream LDM.