[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20040802: LDM pqact child Core Dump Problem



>Kevin,
>
>The child of pqact is most probably a decoder or script that you
>are piping to from pqact. If you have a core file, the comand "file
>core" will tell you what process created it- or check your decoder logs
>and see if you can match the process ID with the child process. 
>You can cycle pqact into verbose mode (using the -USR2 fignal)
>to get a line by line listing of pqact if you need tosee each action
>being processed to track down the child.
>
>One common problem with decoders is having more than one instance
>writing to the output file, which can occur if your
>system if falling behind. In this case, you would also probably notice
>pbuf messages in your LDM logs. A second -USR2 signal to pqact will
>cycle logging into debug mode. You would see "delay" messages there
>indicating how long it takes to process products in the queue.
>If the delay time is climbing, it also would signal that your
>products are backing up.
>
>Steve Chiswell
>Unidata User Support

Steve...

I've been able to follow the pid's to find the offending program, dcacft.
This is the one that keeps failing.  My pqact.conf entry looks like:

DDS|IDS (^U[ABDR].... ....|^XRXX84 KAWN|^YIXX84 KAWN) ([0-3][0-9])([0-2][0-9])
        PIPE    /usr/GEMPAK5.6/bin/linux/dcacft
        -e GEMTBL=/usr/GEMPAK5.6/gempak/tables
        /arpsdata/ldm1/ingest/gempak/pirep/YYMMDDHH_pirep.gem

It looks like our version of GEMPAK is rather old, so the first step is to
get our local admin person to upgrade it.

You mention "pbuf" messages.  Are these the messages that say "pbuf_flush"?
I've always had *LOTS* of them, so I assumed they were normal.

I just noticed that I have a bunch of

        pbuf_flush (##) write: Broken pipe

that seem to occur around the same time "dcacft" fails.

Thanks for your assist.

        == kwthomas ==

>
>On Mon, 2004-08-02 at 12:02, Kevin W. Thomas wrote:
>> Hi...
>> 
>> Recently, while looking over some ETA and NEXRAD files received via LDM I
>> noticed that there were periods when data was lost.  After doing lots of
>> checking around, with the help of a local System Administrator, I discovered
>> that them data gaps are strongly correlated with the message:
>> 
>>      pqact[pid]: child ##### terminated by signal 11
>> 
>> Signal 11 is "segmentation violation".
>> 
>> I have a second LDM system running a similar ingest configuration.  Checking
>> its log files shows the same problem.
>> 
>> Both systems are Intel, though I don't know what cpu's.  Both run RedHat 9.x.
>> The first has logged 29 seg faults today, with the second logging 16.  There
>> are no common times in either log.
>> 
>> I checked another Intel machine, unknown version of RedHat, probably 7.x or
>> 9.x, that had been running LDM a few months ago.  It had 30 on the last full
>> day of operation at that system.
>> 
>> Everything is running LDM 6.0.14.
>> 
>> Any ideas would be greatly appreciated.
>> 
>>      Kevin W. Thomas
>>      Center for Analysis and Prediction of Storms
>>      University of Oklahoma
>>      Norman, Oklahoma
>>      Email:  address@hidden