[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20020208: LDM resources under Linux



Unidata Support wrote:
> 
> ------- Forwarded Message
> 
> >To: Unidata Support <address@hidden>
> >From: David Wojtowicz <address@hidden>
> >Subject: Re: 20020206: large CONDUIT latencies at UIUC
> >Organization: UCAR/Unidata
> >Keywords: 200202081436.g18Ea8x27570
> 
> We're still experiencing very large peaks in latencies on NMC2
> to flood.atmos.uiuc.edu
> 
> I am wondering if the machine itself could be the bottleneck.
> The load avg  gets up fairly high during portions of the day (9+!)
> and doesn't drop below about 3 ever.   The load appears entirely
> due to LDM relay activity. (It does not run pqact or any other time
> consuming process and the load drops to near zero when the LDM is stopped)
> It is servicing several NMC2 and a good number of NNEXRAD|FNEXRAD downstream
> requests. It is a 400Mhz PC running Linux, 512MB dedicated only to LDM
> relay.
> 
> I've seen somewhat high loads on other machines running LDM and servicing
> lots of requests.  How can one tell at which point the machine has become
> the bottleneck?   If so, what is the recommended capacity to be able to
> handle this better?
> 
> --
> | David Wojtowicz, Sr. Research Programmer
> | Department of Atmospheric Sciences Computer Services
> | University of Illinois at Urbana-Champaign
> | email: address@hidden  phone: (217)333-8390
> 
> ------- End of Forwarded Message

Hi David,

That does sound like a high load, especially for a relay-only machine. 
We've always considered rpc.ldmd to be relatively easy on the CPU, as
opposed to pqact and any decoders that are also running.  But CONDUIT is
a bear.  How many sites are actually connecting to you?  Of those, how
many are requesting all or part of CONDUIT?  And, how many rpc.ldmds are
running?

Here, at Unidata, I can see how timely products are arriving at your
downstream sites by running 'notifyme' to them.  For example, according
to our site contact list (which isn't always accurate) I see that
climate.geog.udel.edu should be feeding from you.  Here are two products
as they arrive at squall:

Feb 08 23:48:21 notifyme[15092]:      177 20020208234820.046 IDS|DDPLUS
448  SRUS56 KWOH 082341 /pRRSMRX
Feb 08 23:48:21 notifyme[15092]:     1052 20020208234820.048 IDS|DDPLUS
449  SRUS74 KWOH 082341 /pRRSOUN
Feb 08 23:48:21 notifyme[15092]:      252 20020208234820.050 IDS|DDPLUS
450  SRUS70 KWOH 082341 /pRRSLIX
Feb 08 23:48:21 notifyme[15092]:      177 20020208234820.052 IDS|DDPLUS
451  SXUS44 KWOH 082341 /pRRSOAX

and as they arrive at climate.geog.udel.edu:

Feb 08 23:48:21 notifyme[15091]:      177 20020208234820.046 IDS|DDPLUS
448  SRUS56 KWOH 082341 /pRRSMRX
Feb 08 23:48:21 notifyme[15091]:     1052 20020208234820.048 IDS|DDPLUS
449  SRUS74 KWOH 082341 /pRRSOUN
Feb 08 23:48:22 notifyme[15091]:      252 20020208234820.050 IDS|DDPLUS
450  SRUS70 KWOH 082341 /pRRSLIX
Feb 08 23:48:22 notifyme[15091]:      177 20020208234820.052 IDS|DDPLUS
451  SXUS44 KWOH 082341 /pRRSOAX

(Interesting that the PIDs are within 1 of each other.)  Assuming both
squall's and udel's clocks are accurate, there is at most a one second
delay between the two.  Assuming that all squall rpc.ldmds are getting
equal resources, it doesn't look like squall is a bottleneck, at least
not now while no CONDUIT data is arriving.  I could find out when the
various CONDUIT surges occur, and do some more testing if you like.

Of course, this approach includes time for a product to travel across
the network, which should not be counted in squall's contribution to a
product's latency per se.  

Or, our stats page give some indication of how things are going.  Here's
the page for FOS routing:
http://www.unidata.ucar.edu/projects/idd/status/idd/fosTopo.html.  There
are similar pages for MCIDAS and a few others, but not CONDUIT.  But, I
think that CONDUIT handling would impact other feeds to some degree. 
From the FOS page, the delta between squall latencies and downstream
latencies is pretty small, except in a few cases.  Since it's only a few
cases of bad latencies, that implies that the connection is more likely
the culprit.  These statistics can be an hour old.

What I think you're asking is how to measure the delay that is being
accrued at your machine and only at your machine.  I think that would be
possible except for a bug in rpc.ldmd.  On my own machine, I put my
inbound rpc.ldmd in debug mode.  This lists product arrival time and
signature for every product.  Then I also tried to put an outbound feed
that relays the same products in debug mode, but this regularly
crashes.  (This is a known bug.  I think that the buffer in which the
output is placed is too small, resulting in a segmentation violation.
Hopefully it will be fixed in 5.1.5)  If I could successfully do that, I
could track how long it takes a particular product to be relayed to all
downstream sites.  To my knowledge, we've never done such a
calculation.  But, in theory anyway, it could be done.

Regarding recommended capacity, we make some general recommendations
about memory, disk speed, and CPU speed based on how much data is
expected to pass through a site and number of downstream sites. We
identify sites that are overloaded by "unsatisfactory" latencies at
downstream sites or if their administers are having trouble with
responsiveness.  

Are you concerned that you're not serving your downstream sites
properly?  Or, are there things you would squall to be able to perform
that it can't?  Is it not responsive enough?  I can talk with our sys
admin about possible improvements in the OS configuration.  Or, if the
burden is too high we can try reshuffling the topology.  

Also, we have had some success with CONDUIT latencies in particular by
splitting the feed up into multiple connections.  The cost is more
rpc.ldmds on both the sending and recieving end, so it affects other
sites besides your own.  But the net effect was a big improvement.  This
could be a possibility for you.  Let me know if you'd like to pursue
this further, or if you have any further questions.

It was a pleasure to meet you at AMS and to learn how to say your name!

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************