[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20011204: LDM: pqbinstats & system crash



Unidata Support wrote:
> 
> ------- Forwarded Message
> 
> >To: <address@hidden>
> >From: Tom McDermott <address@hidden>
> >Subject: LDM: pqbinstats & system crash
> >Organization: UCAR/Unidata
> >Keywords: 200111301311.fAUDBmN10047
> 
> Hi,
> 
> I don't think this is the sort of problem that lends itself to a solution,
> but I thought I would report anyways.  My server is a Sun SparcStation 10
> running Solaris 7 and ldm 5.1.4.  At 3:34AM EST today something happened
> which seems to have been triggered by the pqbinstats program.  From the
> system log:
> 
> Nov 30 03:34:43 vortex unix: BAD TRAP: type=2 rp=fc0997c4 addr=0 mmu_fsr=0
> rw=0
> Nov 30 03:34:44 vortex unix: pqbinstats:
> Nov 30 03:34:44 vortex unix: Illegal instruction
> Nov 30 03:34:44 vortex unix: pid=20212, pc=0xf00647fc, sp=0xfc099810,
> psr=0x4080
> 10c5, context=144
> Nov 30 03:34:44 vortex unix: g1-g7: 78727300, 0, f8ba76d8, 640, fc099b80,
> 1, f73
> 3e9a0
> Nov 30 03:34:44 vortex unix: Begin traceback... sp = fc099810
> Nov 30 03:34:44 vortex unix: Called from f008fda4, fp=fc099878,
> args=f8ba76d8 fc
> 099a38 fc099b80 fc099ee0 fc099b80 0
> Nov 30 03:34:44 vortex unix: Called from f0090148, fp=fc0998d8,
> args=fc0999c0 fc
> 099a38 f8ba76d8 0 0 4000000
> Nov 30 03:34:44 vortex unix: Called from f0066e94, fp=fc099b80, args=0
> efffec70
> 0 0 0 1f22c
> Nov 30 03:34:44 vortex unix: Called from 13444, fp=efffef10, args=f 38520
> 198 0
> 3c06afe0 66
> Nov 30 03:34:44 vortex unix: End traceback...
> Nov 30 03:34:46 vortex unix: panic:
> Nov 30 03:34:46 vortex unix: Illegal instruction
> Nov 30 03:34:46 vortex unix:
> Nov 30 03:34:46 vortex unix: syncing file systems...
> Nov 30 03:34:46 vortex unix:  18
> Nov 30 03:34:46 vortex unix:  5
> Nov 30 03:34:46 vortex unix:  4
> Nov 30 03:34:46 vortex last message repeated 19 times
> Nov 30 03:34:46 vortex unix:  cannot sync -- giving up
> 
> This by itself wouldn't have been too bad, but as the last message might
> lead you to suspect, when the system rebooted, the product queue was
> corrupt.  But instead of the ldm system stopping, the rpc.ldmd server and
> pqact processes continued to run and more server processes were spawned as
> downstream sites kept trying to connect.  This led to a situation where
> the rpc.ldmd processes almost completely chewed up the CPU:
> 
> last pid:  7035;  load averages: 94.12, 92.54, 87.27              07:06:25
> 188 processes: 90 sleeping, 92 running, 3 zombie, 3 on cpu
> CPU states:  0.0% idle, 95.7% user,  4.3% kernel,  0.0% iowait,  0.0% swap
> Memory: 512M real, 338M free, 107M swap in use, 1065M swap free
> 
>   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
>   550 ldm        1  58    0  301M 2696K run     15:55  1.75% pqsurf
>  5735 ldm        1  59    0  293M 2692K run      1:44  1.64% rpc.ldmd
>  5436 ldm        1  59    0  293M 2672K run      2:37  1.44% rpc.ldmd
>  5076 ldm        1  49    0  293M 2672K run      2:33  1.36% rpc.ldmd
>   552 ldm        1  49    0  293M 2328K run     15:42  1.34% rpc.ldmd
>   549 ldm        1  49    0  293M 3280K run     16:11  1.31% pqact
>  4989 ldm        1  59    0  293M 2672K run      2:55  1.27% rpc.ldmd
>  1780 ldm        1  59    0  293M 2684K run      6:56  1.23% rpc.ldmd
>  1419 ldm        1  58    0  293M 2672K run      8:50  1.22% rpc.ldmd
>  4487 ldm        1  59    0  293M 2672K run      3:46  1.17% rpc.ldmd
>  6188 ldm        1  59    0  293M 2684K run      0:55  1.16% rpc.ldmd
>  2542 ldm        1  59    0  293M 2680K run      5:50  1.14% rpc.ldmd
>  1049 ldm        1  49    0  293M 2692K run     11:06  1.13% rpc.ldmd
>  4802 ldm        1  59    0  293M 2680K run      3:24  1.12% rpc.ldmd
>  5892 ldm        1  54    0  293M 2672K run      1:09  1.11% rpc.ldmd
>  6827 ldm        1  49    0  293M 2676K run      0:07  1.11% rpc.ldmd
>  5159 ldm        1  49    0  293M 2672K run      2:22  1.10% rpc.ldmd
>  6420 ldm        1  59    0  293M 2680K run      0:38  1.10% rpc.ldmd
> 
> But after manually killing the rpc.ldmd processes (ldmadmin stop didn't
> work), I remade the queues and all is now well again.
> 
> Tom
> -----------------------------------------------------------------------------
> Tom McDermott                           Email: address@hidden
> Systems Administrator                   Phone: (716) 395-5718
> Earth Sciences Dept.                    Fax: (716) 395-2416
> SUNY College at Brockport
> 
> ------- End of Forwarded Message

Hi there, Tom,

In two years I have not heard of pqbinstats crashing.  If you have a
core file we can see where it crashed and what it was doing, which may
or may not lead us to a conclusion about why it happened.

One possibility is a bad disk block.  It could be that the reboot
detected and repaired that - would that appear in your logs?  If this
happens again, you could run fsck to scan for bad blocks.  If there are
bad blocks underneath the ldm installation, a reinstallation would be
prudent.  

I believe pqbinstats reads the queue, so that might explain the queue
corruption.  It is not uncommon to see the runaway rpc.ldmd processes
once the ldm gets in such a confused state.  At that point, killing them
by hand like you did may be the only option.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************