[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

incident with LDM on shemp, Friday, May 12, 21:49:00 (fwd)




===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Mon, 15 May 2000 12:56:41 -0600
From: Russ Rew <address@hidden>
To: address@hidden
     address@hidden
Subject: incident with LDM on shemp, Friday, May 12, 21:49:00

Chiz,

We restarted the LDM on shemp this morning by shutting it down
normally and then rebooting shemp, because someone noticed and
reported that an ADDE server on shemp wasn't responding.  I copied all
the log files and Mike captured output from top and ps before
rebooting; we also copied the old product queue just in case it would
be useful.  All this is available on shemp in the directory
/local/ldm/logs/incident/.  

At this point I'm not studying this too closely, because it looks like
it may be just problems caused by a router problem that occurred on
Friday and took a while to fix.  I'm afraid I went up to the "Spring
Fling 2000" on the Mesa and missed most of this, but I probably should
have checked shemp's LDM over the weekend.

Apparently this morning before shutting down shemp's LDM there were
301 rpc.ldmd processes running (see ps.out), lots of other associated
processes, and the load average was about 272 (see top.out).  The
pqmon.log.1 showed that products stopped going into the queue between
21:48:45 and 21:49:01 on Friday:

 May 12 21:48:45 pqmon: 116649   128   43223  1875448600    159996     3832     
    2   9747720 6891
 May 12 21:49:01 pqmon: 116705   128   43167  1876524032    159996     3832     
    2   9747720 6906
 May 12 21:49:16 pqmon: 116705   128   43167  1876524032    159996     3832     
    2   9747720 6921
 May 12 21:49:31 pqmon: 116705   128   43167  1876524032    159996     3832     
    2   9747720 6936
 ...

and ldmd.log.2 shows symptoms of network problems during the hour
before that (lots of "pq_sequence failed: I/O error (errno = 5)" and
"nullproc(<hostname>): RPC: Unable to receive" and RECLASS messages).

Robb says it's expected that the LDM starts up a lot of extra
processes when the network is flaky as it was on Friday, so unless you
see something else funny about this, I'm assuming it has nothing to do
with the new pq code.

--Russ

P.S.  There's a problem with starting up pqmon from an exec line in
ldmd.conf.  Instead of sleeping for 30 second intervals between
outputting status lines with

   exec "pqmon -i 30 -l /usr/local/ldm/logs/pqmon.log"

it outputs the product queue status every time a product comes in,
because it gets woken up by the product group signal that a new
product is available.  I'll have to fix this, but for now I just
killed the pqmon started in the LDM's product group and started up a
new one outside the rpc.ldmd's product group appending to pqmon.log ...