[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[IDD #PPJ-526440]: Request for info



Hi,

This is a follow-up to a phone conversation that Alain and I had on
the morning of Monday, July 8:

re:
> As per our telephone conversation, I would like to request the links for
> real time statistics and any pertinent information to help us
> troubleshoot this problem.

When encountering problems receiving data vi the Unidata LDM/IDD, the
best course of action is:

1) check to see if real-time statistics have been reported by your
   machine(s) and are available for display on the Unidata website:

  Unidata HomePage
  http://www.unidata.ucar.edu

    Projects -> Internet Data Distribution
    http://www.unidata.ucar.edu/projects/index.html#idd

      IDD Current Operational Status
      http://www.unidata.ucar.edu/software/idd/rtstats/

        Statistics by Host
        http://www.unidata.ucar.edu/cgi-bin/rtstats/siteindex

2) the left had column in the siteindex page you will find the
   classification of machines reporting real-time statistics by
   their domain name

   The machine(s) reporting real-time statistics for your domain
   will be listed in the right hand column of the siteindex page.
   Each machine name entry is a link to a set of information for
   that machine

   Example:

   Domain                 Hosts
   ca.gc.ec.cmc           ldm-data.cmc.ec.gc.ca [6.10.1]
                          ldm-wxo.cmc.ec.gc.ca [6.8.1]
                          noaaport3.cmc.ec.gc.ca [6.6.4]
                          noaaport4.cmc.ec.gc.ca [6.6.4]
                          tigge-ldm.cmc.ec.gc.ca [6.6.4]

3) the page that will be shown when one clicks on the name of the
   machine of interest will contain a set of links for each datastream
   that is being REQUESTed by that machine

   For instance:

   https://www.unidata.ucar.edu/cgi-bin/rtstats/siteindex?ldm-data.cmc.ec.gc.ca

   Real-time Statistics for ldm-data.cmc.ec.gc.ca [ LDM 6.10.1 ]
   FEED NAME

   HDS         latency     log(latency)     histogram     volume     products   
  topology
   IDS|DDPLUS  latency     log(latency)     histogram     volume     products   
  topology
   NEXRAD2     latency     log(latency)     histogram     volume     products   
  topology

   Cumulative volume summary Cumulative volume summary Graph

4) the things to look at when assessing whether the problem being investigated
   is a local problem, or one upstream are:

   latency     - the amount of time between the creation of a product (i.e., 
when
                 a product is first added to the original LDM queue from which 
it
                 is distributed) and its receipt (i.e., the time that the 
product
                 was received at the local machine)

   volume      - time series of data volume received for the particular feed

   products    - time series of the number of products received for the 
particular
                 feed

   topology    - the route that a product takes from its creation to its 
receiption

5) things that can be gleaned from the items listed in 4):

   latency     - the time history of the latency shows:

                 - if products are being received in a timely manner

                   If the latencies are small, the products are being received 
with
                   little delay.

                 - if there is anything wrong with the system clock on
                   the receiving machine

                   A trend in the lowest latencies typically shows that the 
clock
                   on the receiving machine is drifting.

                   A latency plot where the lowest latency is consistently one 
non-zero,
                   shows that the clock on the receiving machine is either slow 
or
                   fast

   NB:

   - problems with the local clock should be fixed as soon as possible.  If 
they are
     not, then one may either miss products when the LDM is restarted for any 
reason
     or no data will be received for some period of time when the LDM is 
restarted for
     any reason.

   - latencies that approach 3600 seconds (one hour) are a warning that there 
is some
     problem receiving the datastream being REQUESTed.  When the latencies 
exceed 3600
     seconds for an LDM installation that is configured in the "standard" 
manner, data
     _will_ be lost/not received/thrown away upon receipt.  The reason for this 
is
     the LDM was designed for real-time delivery of data, and one of the working
     assumptions is that data that is an hour old is too old to be considered 
real-time.

   volume  - this timeseries shows how much data was received per hour for the 
feed
             in question

   NB:

   - LDM REQUESTs for feeds that have high data volumes (e.g., CONDUIT, 
NEXRAD2, FNMOC, HRRR)
     may need to be split into mutually-exclusive subsets.  The feed that has 
been typically
     split into five subsets is CONDUIT.  With the move to dual polarization 
full volume
     scan radar data, the NEXRAD2 feed has become a candidate for feed REQUEST 
splitting.
   
   latency of low-volume feed(s) is acceptably low while the latency for high 
volume
   feed(s) is unacceptably high

   - very low latencies for a feed like IDS|DDPLUS coupled with very high 
latencies for
     a high volume feed like CONDUIT or NEXRAD2 is a classic indication of 
artificial
     bandwidth limiting in one or more legs in the network path being taken 
during
     data delivery.  We refer generically to this situation as "packet shaping".

     It is our experience that packet shaping is typically done "close" to the
     downstream node (i.e., the machine receiving data).  The network connection
     at/near UCAR/NCAR is never intentionally bandwidth limited, so if there is
     a bottleneck somewhere it is most likely not here.

   - when an instance of what looks to be packet shaping is discovered, it is
     the responsibility of the downstream site to initiate investigations into
     where the bottleneck may be.  We (Unidata/UCAR) are willing to help in the
     investigations and help with resolution of problems, but we typically have
     no influence when the problem resides in the downstream's institution.

6) things to try when latencies for one or more feeds are unacceptably high:

   - determine if there is any network problems at one's institution

   - make sure that the LDM installed on one's machine(s) are functioning
     correctly and reasonably up-to-date

   - check real-time statistics being reported to us (links above) to make
     sure that you really are not receiving the data

     This may sound funny, but it is our experience that a number of sites
     assume that they are not receiving data when they actually are and
     their problem is in processing the data received.

   - if a packet shaping signature is seen, try splitting the high volume
     feed(s) that are experiencing unacceptably high latencies

   - if still having problems after undertaking local investigations, send
     an email to:

     Unidata IDD Support <address@hidden>

     Please do _NOT_ phone individuals in Unidata for help or send email
     to Unidata staff member's private email addresses.  The reason for
     this is that the Unidata staff member may be out of the office
     and not able to respond to personal email or voicemail.  Email sent
     to the address above is reviewed by several Unidata staff throughout
     the weekday and routinely on weekends and even holidays, so it is
     most likely that help will be provided faster.

As we talked about during our phone conversation this morning, it is
my opinion that:

- the clock on one of your machines, ldm-data.cmc.ec.gc.ca, is not being
  properly maintained

  I can say this easily after looking at the latency plot for the
  IDS|DDPLUS feed - the linearly increasing trend in latency indicates
  that ldm-data's clock is drifting.

- the disparity in the latency for IDS|DDPLUS and NEXRAD2 on ldm-data
  indicates that there is some limit to how much data (volume) a
  single network connection can have.  This situation might be mitigated
  by either:

  - finding the source of the bottleneck and getting it fixed

  - or, splitting the high volume NEXRAD2 feed into several (e.g., 5)
    mutually exclusive subsets

  In order to make recommendations on how to split the NEXRAD2 feed,
  we would need to see the LDM configuration file (~ldm/etc/ldmd.conf)
  in use on ldm-data.

As a final comment I would like to add that we hold training workshops
each year for the software packages we support; the next training
workshop for the LDM will be held on August 1-2 at our facility here
in Boulder, CO.  There is still at least one slot open for the LDM training
session, but it may fill in the next day or so.  Information on our
training workshops can be found in:

Unidata HomePage
http://www.unidata.ucar.edu

  Events -> 2013 Training Workshop

Please let me know if there was anything in the above that was unclear
or needs further explanation.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: PPJ-526440
Department: Support IDD
Priority: Normal
Status: Closed