[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20011204: LDM: pqbinstats & system crash



Tom McDermott wrote:
> 
> On Tue, 4 Dec 2001, Anne Wilson wrote:
> 
> 
> > I believe pqbinstats reads the queue, so that might explain the queue
> > corruption.  It is not uncommon to see the runaway rpc.ldmd processes
> > once the ldm gets in such a confused state.  At that point, killing them
> > by hand like you did may be the only option.
> 
> No, the queue corruption was the result of the system crashing.  The
> system was ingesting normally (I have a watch window on my workstation so
> I know) at the time of the crash.  The log said it couldn't sync the disc,
> hence the corruption.  It has been a number of years since I've had this
> queue corruption problem.  But as I recall, the pqact processes at least
> terminated, nor were dozens of rpc.ldmd processes spawned as a result of
> connection attempts.  Perhaps this behavior has changed now.  Ideally,
> once it determines that the pq is corrupt, the entire system should shut
> down, since there is no point in continuing.  How difficult that would be
> to detect from the programmer's point of view, I don't know.
> 
> So my message was really directed more toward the post-crash behavior of
> ldm than toward the cause of the crash.
> 
Tom,

Yes, you're right - pqbinstats would not have corrupted the queue.  That
was a half baked thought on my part. 

In looking back over the archives and searching my memory, I know of
only a few cases of runaway processes within the past few years.  One
was yours, almost a year ago, 12/14/2000, where ldm processes overran
your machine.  In that case it was not clear that they were all rpc.ldmd
processes.  Another was Gilbert's machine, and seemed to correlate with
a particular site trying to connect.  Indeed, one of our own machines
had this problem when a remote host running an unsupported OS was trying
to connect.  Another instance was a site that had upgraded to a version
that required a new queue, but had not upgraded the queue.  This last
one was the only one that clearly involved the queue, indeed, it seems
like the other two didn't.  And, there have been instances where the
queue was corrupted where the ldm didn't spawn processes like this. 

The only really major changes to the ldm recently were the queue
algorithms that were improved about a year ago.  I don't think that
would affect the spawning of children, but I could be wrong about that.

I'm not sure how hard it would be to detect a corrupted queue from
within the code.  Since the problem doesn't appear to occur very often,
I will leave it at making a note of it.  If it reoccurs, I'll reexamine
this position.

Anne


> Tom
> -----------------------------------------------------------------------------
> Tom McDermott                           Email: address@hidden
> Systems Administrator                   Phone: (716) 395-5718
> Earth Sciences Dept.                    Fax: (716) 395-2416
> SUNY College at Brockport

-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************