[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: decoder not working on one LDM machine but it does on another LDM and Re: pbuf_flush



Greg,

After much discussion, I believe the problem is that the K220 computer is
too slow to support the dcmetr decoder with the extra FSL5 feed.  That's
why you are seeing all the pbuf_flush entries in the log files.  What we
believe is happening, the dcmetr decoder PIPE is getting filled, so the
LDM is spawning another dcmetr decoder.  SInce both decoders are writing
to the same file, it's getting a signal 11 and making a corrupt output
file. You should check the Gempak log file for muliple starts of the
dcmetr decoder without shutting down messages.  Also, if you are just
using the NOAAport stream the dcmetr like you said it work OK, ie it can
keep up.

Robb...

~On Thu, 6 Apr 2000, Gregory Grosshans wrote:

> I recently added a NOAAPORT feed similar to the DDS/PPS feed (i.e. request 
> FSL5
> (^[A-FK-X]) ), to a machine running LDM (5.0.9) already receiving the 
> following
> feedtypes (AFOS, DDS, PPS).  Once the NOAAPORT feed started coming in the 
> metar
> decoder (i.e. dcmetr) began terminating frequently from signal 11 resulting 
> in a
> corrupted GEMPAK surface observation file.  This particular machine is a K220 
> (2
> 100 MHz CPUs with 1 GB RAM).
> 
> On a K370 (3 200 MHz CPUs and faster with 1.2 GB RAM) the same LDM (5.0.8) 
> setup
> exists and the resulting surface file is fine and the dcmetr doesn't 
> terminate.
> 
> Do you have any ideas on why a problem exists on one machine but not on the
> other?  Have you seen anything like this before?  I've provided additional
> information about the two configurations and trouble shooting steps I've 
> already
> taken below.  The NOAAPORT ingest runs on a second K210 server with LDM 
> (5.0.8)
> running and the FSL acqserver code (modified by SPC/AWC to make the data look 
> like
> its from the FOS).  This second K210 then feeds NOAAPORT data (via the FSL5
> feedtype) to the K210 and K370.
> 
> Thanks,
> Gregg
> 
> Some specifics follow:
> 
> On the K370 (everything works fine) the following configuration follows:
> 
> ---cut from ldmd.conf---
> exec    "pqexpire -a .5"
> request AFOS    .*      hostname-removed
> request DDS     .*      hostname-removed
> request PPS     .*      hostname-removed
> request FSL5    (^[A-FK-X])     hostname-removed  This is the NOAAPORT stream
> received in-house
> request PCWS    "^FSL\.NetCDF\.ACARS\.QC\..*"   hostname-removed
> ---
> $pq_size = 750000000;
> 
> The K370 uses dcuair and dcmetr decoders and the FEEDTYPE for each decoder is
> FSL5|DDS.  This allows the receipt of data to be ASAP as well as redundant (in
> case the FSL5 feed goes down).
> 
> On the K210 (everything except the metar GEMPAK file works fine) the following
> configuration follows:
> 
> ---cut from the ldmd.conf---
> exec    "pqexpire -a .23333"
> request AFOS    .*      hostname-removed
> request DDS     .*      hostname-removed
> request PPS     .*      hostname-removed
> #request FSL5   (^[A-FK-X])     hostname-removed
> # UA data and AVN MOS
> #request FSL5   (^U[EGHKLPQSZ]|^FE)             hostname-removed
> # SFC/UA data and AVN MOS
> request FSL5    (^U[EGHKLPQSZ]|^FE|^S[AP])              hostname-removed
> ---
> #$pq_size = 750000000;
> $pq_size = 200000000;
> 
> The K210 uses dcuair and dcmetr decoders.  The dcuair output is fine.  The 
> dcmetr
> output becomes corrupted.
> 
> Before adding the entire NOAAPORT feed to the K210 it was receiving the UA 
> and AVN
> MOS data from NOAAPORT just fine.  The LDM at this time was 5.0.8.  After 
> adding
> the entire NOAAPORT stream (request ^[A-FK-X]) the dcmetr became corrupted.  I
> then copied the
> dcmetr decoder from the K370 to the K210 with the GEMPAK surface file still
> becoming corrupted.  Next, I upgraded to LDM-5.0.9 on the K210 but still no
> success.  Then I decreased the request statement to be just upperair, metar 
> and
> AVN MOS data.  However the surface file was still becoming corrupted.  The 
> most
> recent attempt was to lower the queue size from 750 MB to 200 MB (since i'm 
> not
> ingesting all the NOAAPORT data) but the surface file still becomes corrupted.
> 
> PBUF messages are still received  sporadically, usually soon after starting 
> LDM
> up:
> 
> ldmspock@spock> ll ldmd.*
> -rw-r--r--   1 ldmspock   data       8522013 Apr  6 13:56 ldmd.log
> -rw-r--r--   1 ldmspock   data       5645263 Apr  6 00:00 ldmd.log.1
> -rw-r--r--   1 ldmspock   data       1262077 Apr  5 15:19 ldmd.log.2
> -rw-r--r--   1 ldmspock   data       8324510 Apr  5 13:19 ldmd.log.3
> -rw-r--r--   1 ldmspock   data       2973459 Apr  5 00:00 ldmd.log.4
> -rw-r--r--   1 ldmspock   data       5841177 Apr  4 19:03 ldmd.log.5
> -rw-r--r--   1 ldmspock   data          3835 Apr  4 16:25 ldmd.log.6
> -rw-r--r--   1 ldmspock   data       26420210 Apr  4 16:23 ldmd.log.7
> ldmspock@spock> grep pbuf ldmd.log ldmd.log.1
> ldmd.log:Apr 06 06:10:12 spock pqact[17920]: pbuf_flush 35: time elapsed
> 3.339170
> ldmd.log:Apr 06 06:10:23 spock pqact[17920]: pbuf_flush 35: time elapsed
> 6.926433
> ldmd.log:Apr 06 06:10:25 spock pqact[17920]: pbuf_flush 35: time elapsed
> 2.288729
> ldmd.log.1:Apr 05 15:25:05 spock pqact[17920]: pbuf_flush 35: time elapsed
> 4.764068
> ldmd.log.1:Apr 05 15:25:14 spock pqact[17920]: pbuf_flush 35: time elapsed
> 5.619390
> ldmd.log.1:Apr 05 15:25:20 spock pqact[17920]: pbuf_flush 35: time elapsed
> 3.642364
> ldmd.log.1:Apr 05 15:25:35 spock pqact[17920]: pbuf_flush 35: time elapsed
> 11.463704
> ldmd.log.1:Apr 05 15:25:46 spock pqact[17920]: pbuf_flush 35: time elapsed
> 7.433896
> ldmd.log.1:Apr 05 15:25:53 spock pqact[17920]: pbuf_flush 35: time elapsed
> 5.602335
> ldmd.log.1:Apr 05 15:26:12 spock pqact[17920]: pbuf_flush 35: time elapsed
> 6.995401
> ldmd.log.1:Apr 05 15:26:17 spock pqact[17920]: pbuf_flush 35: time elapsed
> 2.613388
> ldmd.log.1:Apr 05 15:26:22 spock pqact[17920]: pbuf_flush 35: time elapsed
> 3.780494
> ldmd.log.1:Apr 05 15:26:34 spock pqact[17920]: pbuf_flush 35: time elapsed
> 2.924638
> ldmd.log.1:Apr 05 15:26:44 spock pqact[17920]: pbuf_flush 35: time elapsed
> 2.035412
> ldmd.log.1:Apr 05 15:30:16 spock pqact[17920]: pbuf_flush 35: time elapsed
> 3.280656
> ldmd.log.1:Apr 05 21:10:08 spock pqact[17920]: pbuf_flush 35: time elapsed
> 5.369202
> ldmd.log.1:Apr 05 23:10:09 spock pqact[17920]: pbuf_flush 35: time elapsed
> 2.265447
> 
> DCMETR processes killed follow.  The processes killed during 15Z in ldmd.log.1
> were another decoder receiving test products.  As you can see there haven't 
> been
> many child processes killed today, but the file is corrupted.  Performing a 
> sflist
> on the surface obs file resulted in a segmentation fault will the 1Z output 
> was
> displaying.
> grep child ldmd.log ldmd.log.1
> ldmd.log:Apr 06 04:07:27 spock pqact[17920]: child 5137 terminated by signal 
> 11
> ldmd.log.1:Apr 05 15:28:49 spock pqact[17920]: child 20923 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 15:28:50 spock pqact[17920]: child 20938 terminated by 
signal 11
> 
> ldmd.log.1:Apr 05 15:28:57 spock pqact[17920]: child 21095 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 15:31:52 spock pqact[17920]: child 24263 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 15:31:54 spock pqact[17920]: child 24292 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 15:33:17 spock pqact[17920]: child 25397 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 16:07:30 spock pqact[17920]: child 5674 terminated by 
> signal 11
> ldmd.log.1:Apr 05 16:13:49 spock pqact[17920]: child 15384 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 16:31:30 spock pqact[17920]: child 24463 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 16:37:15 spock pqact[17920]: child 27998 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 16:57:56 spock pqact[17920]: child 9436 terminated by 
> signal 11
> ldmd.log.1:Apr 05 17:02:03 spock pqact[17920]: child 10868 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:17:29 spock pqact[17920]: child 20707 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:20:54 spock pqact[17920]: child 20898 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:21:53 spock pqact[17920]: child 22583 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:23:35 spock pqact[17920]: child 22924 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:27:07 spock pqact[17920]: child 23765 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:30:04 spock pqact[17920]: child 25231 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 17:41:46 spock pqact[17920]: child 741 terminated by signal 
> 11
> ldmd.log.1:Apr 05 17:46:38 spock pqact[17920]: child 1216 terminated by 
> signal 11
> ldmd.log.1:Apr 05 17:50:08 spock pqact[17920]: child 3376 terminated by 
> signal 11
> ldmd.log.1:Apr 05 18:01:35 spock pqact[17920]: child 8924 terminated by 
> signal 11
> ldmd.log.1:Apr 05 18:01:55 spock pqact[17920]: child 9569 terminated by 
> signal 11
> ldmd.log.1:Apr 05 18:02:08 spock pqact[17920]: child 9736 terminated by 
> signal 11
> ldmd.log.1:Apr 05 18:04:20 spock pqact[17920]: child 9786 terminated by 
> signal 11
> ldmd.log.1:Apr 05 18:05:06 spock pqact[17920]: child 10241 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:08:47 spock pqact[17920]: child 15683 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:14:30 spock pqact[17920]: child 16273 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:15:40 spock pqact[17920]: child 17962 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:15:44 spock pqact[17920]: child 18700 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:16:07 spock pqact[17920]: child 18714 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:16:08 spock pqact[17920]: child 18899 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:26:21 spock pqact[17920]: child 21652 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:26:41 spock pqact[17920]: child 23263 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:46:58 spock pqact[17920]: child 28329 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 18:57:50 spock pqact[17920]: child 6435 terminated by 
> signal 11
> ldmd.log.1:Apr 05 18:58:56 spock pqact[17920]: child 6554 terminated by 
> signal 11
> ldmd.log.1:Apr 05 19:02:36 spock pqact[17920]: child 7837 terminated by 
> signal 11
> ldmd.log.1:Apr 05 19:03:07 spock pqact[17920]: child 8165 terminated by 
> signal 11
> ldmd.log.1:Apr 05 19:03:50 spock pqact[17920]: child 8377 terminated by 
> signal 11
> ldmd.log.1:Apr 05 19:08:03 spock pqact[17920]: child 13935 terminated by 
> signal 11
> 
> ldmd.log.1:Apr 05 19:09:06 spock pqact[17920]: child 14026 terminated by 
> signal 11
> 
> 
> 
> 
> 
> Robb Kambic wrote:
> 
> > On Thu, 30 Mar 2000, Gregory Grosshans wrote:
> >
> > > Can someone explain what a 'pbuf_flash 15' entry in the ldmd.log file
> > > means?
> > >
> > > This LDM ingests the Gateway channel from NOAAPORT (as FSL5) and PPS,
> > > DDS   (note request
> > > patter below).  The ldm queue is 750 MB.
> >
> > Greg,
> >
> > The pbuf_flush 15 means that the ACTION ie ( FILE, STDIO, PIPE, etc )
> > could not write to the destinations because the resource was not available
> > at that time or it was blocked.  The log entry ususally occurs when disk
> > activity is at it's peak for some reason.  It could be the LDM doing many
> > disk writes or another process tying up the disk.
> >
> > time elapsed 5.690810
> >
> > means that it took 5.6 sec to complete the activity,
> >
> > If this occurs a bunch of time, one could start to loose data.  Then it's
> > time to determine the reason for the pbuf_flush entries.
> >
> > Robb...
> > >
> > > Thanks,
> > > Gregg Grosshans
> > > SPC
> > >
> > > ---information from ldmd.conf---------
> > > pqexpire -a .23333
> > > request FSL5    (^[A-FK-X])     (hostname was here)
> > >
> > > ---information from ldmadmin---------
> > > $pq_size = 750000000;
> > >
> > >
> > > ------portion of ldmd.log-------------
> > >
> > > Mar 30 15:07:24 spock pqact[14333]: pbuf_flush 15: time elapsed
> > > 5.690810
> > > Mar 30 15:07:24 spock DCMETR[10991]: Normal Termination
> > > Mar 30 15:07:24 spock DCMETR[11773]: Starting Up
> > > Mar 30 15:07:31 spock pqact[14333]: pbuf_flush 15: time elapsed
> > > 6.166595
> > >
> > >
> > >
> > >
> > >
> >
> > ===============================================================================
> > Robb Kambic                                Unidata Program Center
> > Software Engineer III                      Univ. Corp for Atmospheric 
> > Research
> > address@hidden                   WWW: http://www.unidata.ucar.edu/
> > ===============================================================================
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================