[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050214: LDM product queue corruption



Gabe,

>Date: Mon, 14 Feb 2005 15:21:52 -0500 (EST)
>From: Gabe Langbauer <address@hidden>
>Organization: Ohio State University
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20050214: LDM product queue corruption 

The above message contained the following:

> The original log is attached, note there is no ldmping issue on this log,
> it seems to die with a rpc.ldmd error...and there is a mention of rtstats.
> I don't know if those are the stats from "do stats"  Everytime subsequent 
> time I issued the start command I got this log (although times were
> different):
> 
> Feb 12 23:24:21 twister ldmping[10477]: SVC_UNAVAIL   0.000601    0
> localhost    RPC: Program not registered
> Feb 12 23:24:21 twister pqcheck[10481]: Starting Up (10472)
> Feb 12 23:24:21 twister pqcheck[10481]: The writer-counter of the
> product-queue is 0
> Feb 12 23:24:21 twister pqcheck[10481]: Exiting

The above are OK.  The "ldmping" entry is from the ldmadmin(1) script
testing to see if an LDM is already running.  The pqcheck(1) entries are
from the same script checking to see that the product-queue is OK.

> I agree, mighty suspicious indeed.  Logs above

The end of the logfile contained this

    Feb 12 22:58:54 twister rpc.ldmd[791]: child 793 terminated by signal 25

Process 793 was a pqact(1) process:

    $ fgrep '[793]' ldmd.log.4
    Feb 12 07:02:16 twister pqact[793]: child 569 exited with status 1 
    Feb 12 07:58:21 twister pqact[793]: child 16497 exited with status 1
    Feb 12 21:12:23 twister pqact[793]: child 11341 exited with status 1
    Feb 12 22:30:00 twister pqact[793]: pbuf_flush (3) write: Broken pipe 

and was, undoubtably, started via an EXEC entry in the LDM
configuration-file, etc/ldmd.conf.

The LDM server exits when an EXEC-ed child process terminates abnormally
due to a seriously bad signal (e.g., SIGSEGV).

Oddly, on my system, signal 25 is SIGCONT and should not cause the
pqact(1) process to terminate.  What is it on your system?

One can work-around this behavior by wrapping EXEC-ed programs in a
shell-script that ensures that their abnormal termination is never seen
by the LDM, e.g.,

    $ cat util/execWrapper
    while true
    do
        "$@"
        logger -p local0.notice "Restarting: $@"
    done

(The above is off-the-top-of-my-head and might need modification.)

The relevant EXEC entry is then replaced with 

    EXEC        "execWrapper prog a1 a2"

(assuming the script is in the "util/" subdirectory and is executable).

Regards,
Steve Emmerson