[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Fri, 16 Sep 2005 09:48:18 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA
>To: address@hidden
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

> I added the logging to the script that we are seeing the most 
> "pbuf_flush Timed out" from.  Yesterday we also broke apart the pqact 
> processing, previously we only had one pqact for all incoming data.  We 
> created a second entry in ldmd.conf for just CRAFT data and the existing 
> entry was modified to act on all data type except CRAFT but we still saw 
> the same errors overnight.  Below is another log entry:
> 
> ---
> Sep 16 03:02:21 b2n1 eldm4[585824]:  7504142 20050916030208.060    FSL5 
> 000  FSL.CompressedNetCDF.MADIS.mesonet4.20050916_0100.gz
> Sep 16 03:02:21 b2n1 pqact[909556]:  7504142 20050916030208.060    FSL5 
> 000  FSL.CompressedNetCDF.MADIS.mesonet4.20050916_0100.gz
> Sep 16 03:02:21 b2n1 pqact[909556]:                pipe: -close 
> /usr/ldm/ncep_bin/run_dcmeso.sh 
> /dcomdev/us007003/ldmdata/obs/surface/mesonet 20050916_0100.gz
> Sep 16 03:04:21 b2n1 pqact[909556]: pbuf_flush 2: time elapsed 120.000055
> Sep 16 03:04:21 b2n1 pqact[909556]: pbuf_flush (2) Timed out
> Sep 16 03:04:21 b2n1 pqact[909556]: pipe_put: 
> -close/usr/ldm/ncep_bin/run_dcmeso.sh/dcomdev/us007003/ldmdata/obs/surface/mesonet20050916_0100.gz
>  
> write error
> ---
> 
> But there is nothing in the dcmeso.log file related to this time:
> 
> ---
> END ---- PID= 1151078  ---- Thu Sep 15 23:02:08 EDT 2005 
> ----------------------------
> exit 0
> + exit 0
> echo "START -- PID=" $$ " ----" `date` "----------------------------"
> + date
> + echo START -- PID= 684064  ---- Thu Sep 15 23:04:21 EDT 2005 
> ----------------------------
> START -- PID= 684064  ---- Thu Sep 15 23:04:21 EDT 2005 
> ----------------------------
>                                                                               
>                                                                               
>           
>                      
> gzip -d > ${1}/$$.${2}
> + gzip -d
> + 1> /dcomdev/us007003/ldmdata/obs/surface/mesonet/684064.20050916_0200.gz
> madisfilename=${1}/`echo ${2} | cut -c1-13`
> ----
> 
> The END time of the previous run_dcmeso.sh instance is 03:02:08Z and the 
> next doesn't start until 03:04:21Z which is right after the "Timed out" 
> error but the file being acted on is 20050916_0200.gz where the "Timed 
> out" was for 20050916_0100.gz.  The "START" line is never added when the 
> the pipe: -close for 03:02:21 is run, it's as if the log is never 
> touched, looking like the script never ran.
> 
> any ideas?

Not yet.  I'm going to ponder this today.

Incidentally, you might want to invoke the newlog(1) utility via
crontab(1) to keep the logfile "dcmeso.log" from growing indefinitely.

> We are running this version of LDM on AIX 5.2, is there any other system 
> information that you would find useful?

I'll let you know.

> I have attached run_dcmeso.sh, it is slightly different from the script 
> that I was talking about yesterday but it is doing the same major steps, 
> unzipping a file and sending it off to a decoder.
> 
> Thanks for the help,
> 
> Justin
[snip]
> #!/bin/sh -vx
> #
> #  This script is EXECed directly by DBNet in order to run the
> #  dcmeso decoder on the data file given in the first argument.
> #
> #    Usage: ./run_dcmeso.sh <mesonet_filename>
> #
> #  Once this is done, the data file itself is then compressed
> #  within its native directory for more efficient short-term
> #  storage.
> #
> 
> # The gzip line must be the first, noncomment line in this script
> # so that stdin is processed correctly
> exec >> /usr/ldm/logs/dcmeso.log 2>&1
> echo "START -- PID=" $$ " ----" `date` "----------------------------"
> 
> gzip -d > ${1}/$$.${2}
> madisfilename=${1}/`echo ${2} | cut -c1-13`
> hhmm=`date -u +%H%M` 
> decoderfilename=${madisfilename}.${hhmm}
> mv ${1}/$$.${2} ${decoderfilename}
> 
> . /ioddev/dbndev/.profile
> 
> #export MADIS_STATIC=/nwprod/lib/sorc/madis/static
> export MADIS_STATIC=$DCDROOT/lib/sorc/madis-2.5/static
> export MADIS_DATA=/dcomdev/us007003/ldmdata
> 
> ln -sf ${decoderfilename} ${madisfilename}
> 
> nice $DCDROOT/bin/decod_dcmeso -v 2 -d 
> /dcomdev/us007003/decoder_logs/decod_dcmeso.log  ${decoderfilename} 
> /dcomdev/us007003/bufrtab.255
> 
> rm -f ${madisfilename}
> 
> #
> #  Compress the decoder input file within its native directory,
> #  in order to conserve disk space for these large files!
> #
> 
> gzip ${decoderfilename}
> 
> #
> #  Explicitly set the script return code to 0, in order to prevent
> #  the "compress" return code from becoming the script return code
> #  (and thereby prevent DBNet from re-running the script for this
> #  particular data file if there is a problem with the compress!)
> #
> 
> echo "END ---- PID=" $$ " ----" `date` "----------------------------"
> exit 0

Regards,
Steve Emmerson