[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20011103: Characterization of data load on LDM and predicting impacts on LDM queue load



>To: address@hidden
>From: Gregory Grosshans <address@hidden>
>Subject: Characterization of data load on LDM and predicting impacts on LDM 
>queue load
>Organization: NOAA/SPC
>Keywords: 200111022252.fA2MqI102912 LDM performance

Hi Gregg,

> Can you tell me if there has been any work done on trying to predict the
> load characterization on LDM, in particular the queue, in relation to
> the type of workstation, disk system or systems being written to, and
> the volume of products being received into the queue?  If so, I'd
> appreciate gleaning any of this information.

Well, we've done some testing as part of developing the product queue
algorithms, to make sure all the product insertion, deletion, and
region management algorithms run in O(log(n)) time, where n is the
number of products currently in the queue.  And we've tested queue
insertion rates in the steady state (with a full product queue) with
small products and a realistic mix of product sizes.  The results of
this testing are in the IIPS paper from last years AMS meeting, which
you can also see at

  http://www.unidata.ucar.edu/staff/russ/papers/ldm51.fm5.pdf

or

  http://www.unidata.ucar.edu/staff/russ/papers/ldm51.html

in PDF or HTML form.

We ran these tests over a fast local network on relatively slow Sun
workstations (300 Mhz Sparcs), with the product queue memory-mapped
to a file on local disk.  We haven't done extensive load testing on
other platforms, but the IDD community is running the LDM 5.1.2 and
later versions on a wide variety of platforms, and we haven't heard
complaints about the product queue performance.

> Briefly, the SPC has two HP J5000 workstations.  Workstation 'A' has a
> HP fiber channel disk system with all LDM data including the LDM queue
> written to the fiber channel disks.  Workstation 'B' has two LVD SCSI
> drives mirrored.  The LDM queue is written to the SCSI drives while all
> data handled via 'FILE' and 'EXEC' in the pqact ends up being written to
> the NetAPP NFS filer.  Both systems have 3 GB of RAM and the local file
> systems use lots of cache and are configured for high performance at the
> sacrifice of data integrity.  On both systems the LDM queue is 750 MB.
> 
> Both systems receive a NOAAPORT feed consisting of GOES-EAST, GOES-WEST
> and the NWSTG channel.  The feedtype for this stream is WMO and NNEXRAD
> for the radar products.  In addition a local Unisys WeatherMAX radar
> server is injecting several national mosaic radar products and about 5
> products from every radar site into the LDM queue as a feedtype of WSI.
> Also, as a backup to the NOAAPORT feed there is an X.25 feed into each
> workstation from an upstream host (i.e. checkov), as a feedtype of
> WMO|SPARE.  The X.25 feed is approximately 56 Kbps.  The PCWS and FSL2
> streams are acars and 6-minute profiler data.  The EXP is the MESOWEST.
> The second WSI feed is a backup of the National Mosaic radar products
> from the AWC.
> 
> When we transitioned to these systems about 18 months ago everything
> worked fine.  However, over time I've noticed workstation 'B', with the
> NetAPP, to have more and more pbuf_flush log messages (see below).  So
> far I haven't encountered any 'pipe broken' messages with the pbuf_flush
> log entries.
> 
> Workstation 'B' also receives a significant amount of model data from
> NCEP Headquarters via two T-1s.  Most of the data is in GEMPAK format,
> but META files are created, BUFR is converted to GEMPAK GRIDS for each
> ETA and NGM cycle.  Thus, the machine is also writing a lot to the
> NetAPP NFS server.
> 
> We will be transitioning to some different hardware over the next 1-2
> months (i.e. a J6000 and faster NetAPP NFS box).  Also, 11 more radar
> products from each NEXRAD site will begin flowing into NOAAPORT, and
> into the LDM queue, later this month.  At 11 products per site this
> equates to over 1500 products per 6 minutes or 15000+  products per
> hour.
> 
> Can you tell me if there is anyway to determine ahead of time what type
> of impact on the system and/or LDM one can expect with the addition of
> more data (e.g. 11 products from each site)?  Is there anyway to
> characterize what type of load a given set of hardware (e.g.
> workstation, disks, etc.) and data flow into the LDM queue will have on
> a system?

It's difficult to develop any kind of analytical model that's very
accurate, because the memory management of the list of free regions in
the queue is based on algorithms that perform well in practice but
that are not amenable to analytical treatment.  I can tell you that
decoding and filing products to a remote NFS server may be more of a
limitation than the product queue.  The addition of 11 product per
site is about a 150% to 200% increase in the number of radar products,
but that may not be significant considering all the other kinds of
products you are handling.  

Our tests showed that the LDM on relatively slow Sun could handle
bursts of 300 products per second even with concurrent garbage
collection without adding significant latency, but occasionally there
may be a several second pause in deleting enough old products out of
the queue and coalescing their memory regions to make space for
storing a new large product.

> Do any of the top IDD nodes (e.g. motherlode, I believe UofW, Unisys?)
> inject all three channels into the LDM queue?  Do they see similar
> pbuf_flush statements in ldmd.log?

Yes, we have several LDM systems handling all channels of NOAAPORT,
including motherlode.  motherlode is currently getting pbuf_flush
messages occasionally:

 Nov 05 03:46:42 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.339703
 Nov 05 04:00:08 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.547173
 Nov 05 04:00:14 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.593650
 Nov 05 04:00:22 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
8.302111
 Nov 05 04:56:57 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.012492
 Nov 05 09:42:22 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.024029
 Nov 05 10:35:35 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.325135
 Nov 05 10:59:58 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.279881
 Nov 05 11:00:01 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.435090
 Nov 05 11:00:09 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.105940
 Nov 05 13:35:45 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.265940
 Nov 05 15:58:37 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.013965
 Nov 05 16:35:07 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.171625
 Nov 05 16:35:37 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.060463
 Nov 05 18:35:07 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.135467
 Nov 05 22:38:17 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.158068
 Nov 06 04:00:13 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
2.359129
 Nov 06 04:00:28 motherlode.ucar.edu pqact[2898]: pbuf_flush 20: time elapsed   
7.052843
 Nov 06 05:45:12 motherlode.ucar.edu pqact[4596]: pbuf_flush 8: time elapsed   
2.150907
 Nov 06 07:35:43 motherlode.ucar.edu pqact[5765]: pbuf_flush 4: time elapsed   
2.081593
 Nov 06 10:35:22 motherlode.ucar.edu pqact[5765]: pbuf_flush 4: time elapsed   
2.801245
 Nov 06 13:35:31 motherlode.ucar.edu pqact[5765]: pbuf_flush 4: time elapsed   
3.379242

> Any insight or comments into these areas and how perhaps how some of the
> top tier sites are handling all of the data if appreciated.
> 
> Thanks,
> Gregg Grosshans
> Storm Prediction Center

We're also wondering about the effect of the extra NEXRAD products.
The only thing I can suggest is to develop a simulation of the extra
load by using something like "pqsend" to send a bunch of extra
products to a test LDM to see if that causes load problems.
Currently, we're just crossing our fingers.  If motherlode has
problems handling the increased load, we may have to change its
configuration to not feed so many sites or set up a separate LDM on a
machine that doesn't do all the decoding and filing ...

I'm CC:ing Anne Wilson, in case she has any more insights.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu