[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Thu, 06 Oct 2005 09:23:50 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

> We completed the installation of LDM 6.4.2.4 yesterday (~15Z)  and 
> things were running well until 01:08Z today, that's when we again 
> stopped receiving NEXRAD2 data.  I did a restart of LDM once I got in 
> this morning and NEXRAD2 is coming in. 
> 
> The TOC Level II radar cluster feeds us the NEXRAD2 data (IP: 
> 140.90.85.102).
> 
> I have attached the log files from that system; they only seem to have 
> the standard logging enabled and an LDM restart is in their cron for 
> twice a day (this has to do with their cluster availability).
>
> I didn't see anything around the 01:08Z (10/06/05) time that would 
> indicate why we stopped receiving NEXRAD2 data.

Did the outage correspond to a restart of the LDM on 140.90.85.102?

If you send USR2 signals to the downstream LDM process that's receiving
the NEXRAD2 data-products from 140.90.85.102, then you can put the
process into debug mode and see where it's spending its time.  Just
don't leave it in that mode!  :-)

> Here is a snippet from our ldmd.log at the time we stopped receiving data:
> ---
> Oct 06 01:08:25 b2n1 140.90.85.102[1437928] INFO:    10539 
> 20051006010822.446 NEXRAD2 386060  L2-BZIP2/KMPX/20051006010419/386/60
> Oct 06 01:08:25 b2n1 pqact[815190] INFO:    10539 20051006010822.446 
> NEXRAD2 386060  L2-BZIP2/KMPX/20051006010419/386/60

ASIDE: The above entries indicate that a data-product that has the time
2005-10-06 01:04:19 encoded in it's product-identifier was created at
2005-10-06 01:08:22.446 -- a difference of over 4 minutes.  I hope this is
OK.

> Oct 06 01:08:25 b2n1 pqact[815190] INFO:                file: -close    
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KMPX/KMPX_20051006_010419.bz2
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:     3666 20051006010825.900 
> FNEXRAD 555  UJXX82 KWBC 060104
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/sonde/20051006.sonde
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcusnd  -v 2  -t 600  -d 
> /dcomdev/us007003/decoder_logs/decod_dcusnd.log        
> tables/stns/sonde.land.tbl      tables/stns/sonde.ship.tbl      
> /dcomdev/us007003/bufrtab.002
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:       84 20051006010825.900 
> FNEXRAD 556  UANT01 KWBC 060105
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/aircraft/20051006.aircraft
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcacft  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcacft.log        
> tables/stns/pirep.tbl   tables/stns/airep.tbl   
> /dcomdev/us007003/bufrtab.004
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:      194 20051006010825.900 
> FNEXRAD 557  UPPK01 OPKC 060000
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/sonde/20051006.sonde
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcusnd  -v 2  -t 600  -d 
> /dcomdev/us007003/decoder_logs/decod_dcusnd.log        
> tables/stns/sonde.land.tbl      tables/stns/sonde.ship.tbl      
> /dcomdev/us007003/bufrtab.002
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:      144 20051006010825.900 
> FNEXRAD 558  SAGR33 LGAT 060100
> Oct 06 01:08:25 b2n1 140.90.85.102[1437928] INFO:    22130 
> 20051006010822.509 NEXRAD2 559026  L2-BZIP2/KLWX/20051006005943/559/26

ASIDE: The last entry indicates that a data-product that has the time
2005-10-06 00:59:43 encoded in it's product-identifier was created at
2005-10-06 01:08:22.509 -- a difference of almost 9 minutes.  I hope
this is OK.

> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/metar/20051006.metar
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcmetr  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcmetr.log        
> /dcomdev/us007003/bufrtab.000   tables/stns/metar.tbl
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      556 20051006010825.964 
> FNEXRAD 579  SAUS70 KWBC 060100 RRE
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      136 20051006010825.964 
> FNEXRAD 580  SAUS80 KWBC 060100 RRE
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      140 20051006010825.968 
> FNEXRAD 581  SPUS80 KWBC 060104
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      149 20051006010825.969 
> FNEXRAD 582  SASD31 OEJD 060100 RRA
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:       72 20051006010825.971 
> FNEXRAD 583  ULCM01 FKKD 060000
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:       82 20051006010825.972 
> FNEXRAD 584  SAAG00 SACO 060100
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:     1177 20051006010825.974 
> FNEXRAD 585  SAIY40 LIIB 060100
> ----
> 
> The last entry related to NEXRAD2 is this entry into the queue: "Oct 06 
> 01:08:25 b2n1 140.90.85.102[1437928] INFO:    22130 20051006010822.509 
> NEXRAD2 559026  L2-BZIP2/KLWX/20051006005943/559/26", pqact never acts 
> on the entry...
> 
> I was poking around the email archives and saw where you suggested to a 
> user that 4 instances of pqact is sometimes a good idea when working 
> with the CRAFT data,
> 
> pqact -f CRAFT -p BZIP2/K[A-D] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[E-K] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[L-R] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[S-Z] etc/pqact.craft
> 
> Do you think adding these entries to our ldmd.conf would make any 
> difference?

Processing data-products using multiple, disjoint pqact(1) processes can 
improve throughput over a single pqact(1) process if the single process
has reached the limit on the number of open file-descriptors and has to
constantly close file-descriptors in order to open a new ones.  You can
determine the limit on the number of open file descriptors via the
commands "ulimit -a" or "getconf OPEN_MAX".

Also, be sure that the pqact(1) configuration-file contains only
relevant entries because the pqact(1) program tests every data-product
against every entry.

Using a faster computer can also help -- as will having sufficient
memory so that the relevant processes remain in memory rather than being
swapped to disk (the top(1) utility can help determine this).

> Thanks again for the help,
> 
> Justin
[snip]

Regards,
Steve Emmerson


NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.