[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #WEM-615262]: LDM write warnings

Subject: [LDM #WEM-615262]: LDM write warnings
Date: Thu, 18 Jul 2013 09:07:26 -0600
Hi Stacey,

re:
> We are ingesting Level-II
> radar data and seem to be running into some type of LDM issue.  We think it
> might be a resource issue, but we cannot seem to find what it could be.  We
> are ingesting all the Level-II radar data on one machine.  That machine
> starts a separate process for each radar site.  That process reads data
> from STDIN and once enough data has been received, it process the data into
> images and places those images on the LDM queue.  It then returns to
> reading from STDIN.  With this setup, we are getting a lot of these types
> of warnings:
> 
> pqact[5517] WARN: write(13,,4096) to decoder took 9 s: 
> /wxhub/decoders/l2decode/l2decodev2KDLH
> pqact[5517] WARN: write(7,,4096) to decoder took 8 s: 
> /wxhub/decoders/l2decode/l2decodev2KEOX
> pqact[5517] WARN: write(18,,4096) to decoder took 9 s: 
> /wxhub/decoders/l2decode/l2decodev2KLCH

This are informational warnings that a process is taking longer than
expected to finish.  The messages are not necessarily an indication of
a problem that needs addressing especially since the times shown are
modest.  If the times were large, then it would be an indication that
the situation should be investigated further.

re:
> We have our LDM queue set to 2GB and that queue is located in RAM.  We
> restart the LDM and a few minutes later we start receiving the messages
> above.  After about 30 minutes of processing data, we start getting these
> messages:
> 
> ulog DEBUG: Deleting oldest to make space 97024 bytes
> ulog DEBUG: Deleting oldest to make space 27616 bytes
> ulog DEBUG: Deleting oldest to make space 167104 bytes

These debug messages are informing you that the LDM queue routines are
doing what they are designed to do which is delete the oldest products
in the queue to make space for new ones that are being received.
If the products being deleted have already been processed, then there
is no problem.  If the products being deleted have not been processed,
then there is a problem.

re:
> So it seems we are losing data, but we cannot find out why.

Neither of the things you listed above indicate that products are being
lost.  Have you checked to see if you are, in fact, actually losing
products?  This could be done, for instance, by making an inventory of
the products that were received and processed and comparing it to an
inventory that your upstream feed site received and sent you.  The
LDM utility 'notifyme' can be of great help in this kind of investigation.

re:
> The machine we
> are running this on is extremely fast and all it does is process the
> Level-II data.  It has two 16 core CPUs clocked at 2.6Ghz.  So we have 32
> cores in this thing and 32 Gigs of memory. We are not writing anything out
> to disk, we are building images from the data and placing those images back
> on the LDM queue.

OK.  Presumably those images are then being sent to other machines?

Question:

- how do you have the actions structured in your LDM pattern-action file(s)?

  I ask because if all of the processing actions are in a single
  pattern-action file, then you may have a processing bottleneck.  Each
  pattern-action file action is checked against every product regardless
  of whether a previous action in the pattern-action file matched and
  was executed.  If your pattern-action file has a LOT of actions, it
  may take a "long" time to work through all of the actions for a product
  before the next product can be acted on.

re:
Below is a top command showing the computer's state at
> the time we are receiving the WARN and DEBUG messages.
> 
> top - 13:31:37 up 1 day, 17:30,  4 users,  load average: 2.74, 2.44, 2.40
> Tasks: 426 total,   3 running, 423 sleeping,   0 stopped,   0 zombie
> Cpu(s):  6.7% us,  0.1% sy,  0.0% ni, 93.2% id,  0.0% wa,  0.0% hi,  0.0% si
> Mem:  33250348k total,  2718984k used, 30531364k free,   159560k buffers
> Swap: 116177060k total,        0k used, 116177060k free,  2043820k cached

Nothing looks out of line here.

re:
> Here are some specs on what we are running:
> 
> LDM version: 6.8.1
> CPU: 32 cores @ 2.6Ghz
> RAM: 32 Gigs
> OS: Custom version of Debian Linux

This looks like a very capable machine.  I can not comment on whether
or not a custom version of Debian Linux would cause problems, but I
doubt that it would.

re:
> We were wondering if maybe there is some buffer limit for STDIN in Linux
> that is getting reached. Each process is reading data from STDIN.

Yes, there are buffer limits for *nix pipes.  If your decoder process is
fast enough, however, the buffer limit should not be a problem.  If you
are convinced that you are, in fact, losing data, you may want to
change your decoding strategy to writing the products to disk and running
your decoding processes on the disk images directly.  This would
eliminate any bottleneck that may be encountered in the way you are
currently handling the data.

re:
> Then it
> goes off and build some images, which could take 5-10 seconds to complete.
> It then comes back and beings reading from STDIN again.  Would this cause
> STDIN to get backed up while the decoder is building the images?

Yes, most certainly.

re:
> Could this be our bottle neck?

It could be, yes.

re:
> Below are our ldmd.conf and pqact.conf lines for this data:
> 
> ldmd.conf:
> request NEXRAD2 ".*"    64.147.208.204
> 
> pqact.conf:
> CRAFT
> ^L2-([^/]*)/(.*)/([0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9][0-2][0-9][0-5][0-9][0-9][0-9])/[0-9]+/[0-9]+/[IES]/V0[123467]/0$
> PIPE    /wxhub/decoders/l2decode/l2decodev2 \2

Thanks for including these; they help to understand how you are
processing the data.

re:
> Any suggestions you could give would be greatly appreciated.

One easy thing to do would be to create multiple pattern-action files
each of which processed a mutually-exclusive subset of the data
being received.  This would be as simple as:

- copying the pattern-action file to, say, 4 other pattern-action files
  (named differently, of course) and then changing your single ldmd.conf
  EXEC line into 5 EXEC lines, each of which processed 20% of the
  products

  This 5-way splitting of the processing would lessen the time spent
  waiting before the products for the next NEXRAD could be processed.

- FILEing the NEXRAD products to disk and changing your decoding
  actions to read from the disk files directly

  This would insure that the products received would be available for
  processing.  The tricky part of this and all Level II decoding is
  that there is no indicator for when the last piece of a volume
  scan is received.  There is a product that indicates that it is
  the last part of a volume scan, but there is no guarantee that
  it is received last.  Our approach to processing Level II data is
  to write the pieces to disk in their own subdirectory and then
  kick off a process that reassembles the pieces into a full
  volume scan.  That process is responsible for determining
  if all pieces have been received; it will sleep for a bit and
  look again if it "thinks" that there are pieces from the original
  volume scan that have not been received yet.  This approach
  works nicely, but it does delay the availability of the data
  a bit (but not that long).

re:
> Thank you,

No worries.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: WEM-615262
Department: Support LDM
Priority: Normal
Status: Closed
Prev by Date: [LDM #VXF-693432]: Trying to request NEXRAD2 data from Unidata
Next by Date: [LDM #RHW-703095]: Errors: "Upstream LDM says we're not allowed to receive requested products"
Previous by thread: [LDM #BPI-711373]: LDM segfaulted
Next by thread: [LDM #RHW-703095]: Errors: "Upstream LDM says we're not allowed to receive requested products"
Index(es):
- Date
- Thread