[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #LWK-665773]: LDM load is growing



Paul,

> Hi.  I am one of the administrators of the ldm network for RAL at NCAR.
> 
> We have reached a point where one of the computers that forms the
> backbone of our internal LDM network is over loaded.
> 
> This computer awl is the single exposed host through which all data
> entering/leaving RAL passes.
> 
> The load on this machine has been slowly growing.  Please see the
> attached graphs showing load, and # of processes.

That's a lot of jobs.  What are they?  How many of the processes are downstream 
LDM-s?  How many are upstream LDM-s?

> We are also seeing rather large latencies in some cases:
> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?NEXRAD2+awl.rap.ucar.edu+LOG
> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+awl.rap.ucar.edu+LOG

The CONDUIT latencies probably mean that you're losing data.

> I have a few questions:
> 
> #1)  What exactly is the latency in the rtstats graphs showing? Is
> this the time between when the upstream server makes a product
> available, and when awl receives it?

Product latency is the time that a product is received minus the time that the 
product was created (at the source site).

> #2)  Can poorly tuned downstream LDMs negatively impact the peformance
> of LDMs upstream?  For example, if the downstream LDMs are
> requesting more data then they actually use, this would create
> additional load on the upstream LDMs, right?

Correct, although less than you might think because upstream LDM processes use 
a read-lock on the product-queue and such locks can be shared.  There will be 
one upstream LDM process per incoming REQUEST, however, so the number of such 
requests affects the total number of processes on the system.

> #3)  Our current LDM topology looks something like this:
> 
> 
> +----------------+
> +----------+                      /+  shim          |
> |          |                   /-- |                |
> |          |    +----------+ /-    +----------------+
> | Internet |<--->|  awl     +-
> |          |    |          \----    +-----------------+
> |          |    +----------+\   \---| hack            |
> |         |                  \      |                 |
> |          |                  \     +-----------------+
> +----------+                   \
> \-------------------+
> | bit               |
> |                   |
> +-------------------+
> 
> 
> 
> awl is the exposed host/gateway with the rest of the world.  It has
> nothing in its pqact.conf.

That's good.

>  shim, hack & bit all put data to
> crossmounted disks for use by the rest of the organization.  There
> are also dozens of other internal machines (not pictured) which send
> or receive data from awl.
> 
> We can probably combine hack & bit into one machine, which would
> give us one extra machine to try and mitigate the load problems on
> awl.
> 
> We are considering a few ways to use this extra machine (hack):
> 
> #1)  Add hack as an additional exposed host.  Move some of awl's
> requests to the outside world to hack, and then have awl
> request this data from hack.
> 
> #2)  Add hack as an additional exposed host.  Make it handle all
> outgoing requests, and awl handle all incoming requests.
> 
> #3)  Add hack as an additional exposed host.  Split up incoming &
> outgoing requests between it and awl as evenly as possible.
> 
> #4)  Add the new machine like this:
> 
> {OUTSIDE1} <---->  {NEWMACHINE}  <--> {AWL} <----> {INTERNAL1}
> {OUTSIDE2} <---/                             \---> {INTERNAL2}
> {OUTSIDE3} <--/                               \--> {INTERNAL3}
> {OUTSIDEN} <-/                                 \-> {INTERNALN}
> 
> 
> What do you think would be the most effective use of this extra LDM
> host?

I'll need more information.  How many incoming data connections do you 
have/need?  How many outgoing?

Do you truly use all the data you receive?

I think a face-to-face meeting is in order.  How about next Monday?

> Thanks,
> Paul Prestopnik

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: LWK-665773
Department: Support LDM
Priority: Normal
Status: Closed