[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20000509: LDM dumping core (fwd)



> ---------- Forwarded message ----------
> Date: Tue, 09 May 2000 12:43:21 -0600
> From: Unidata Support <address@hidden>
> Reply-To: Jason J. Levit <address@hidden>
> To: address@hidden
> Subject: 20000509: LDM dumping core
>
> >To: address@hidden
> >From: "Jason J. Levit" <address@hidden>
> >Subject: LDM dumping core
> >Organization: Center for Analysis and Prediction of Storms, University of 
> >Oklahoma
> >Keywords: 200005091626.e49GPx422939
>
> This is a multi-part message in MIME format.
> --------------32C20D4AC9133573FD86A436
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
>
>   Hello,
>
>   I'm running LDM 5.0.9 on an SGI Origin 2000 system.  Recently, we've
> been experiencing outages with the software every few hours - some LDM
> processes receive an "interrupt" - and then LDM dumps core and dies.
> Would there be a likely reason for this behavior that we can fix?  I'm
> attached a log file from one of the recent crashes.  We've tried to
> diagnose the problem but to no avail.
>
>   Thanks for any help!!!
>
>   Jason Levit
>
> --
> ----------------------------------------------------------------------------
> Jason J. Levit, N9MLA                         Research Associate,
> address@hidden                   Center for Analysis and Prediction of
> Storms
> Room 1014                                   University of Oklahoma
> 405/325-3503                               http://www.caps.ou.edu/
> --------------32C20D4AC9133573FD86A436
> Content-Type: text/plain; charset=us-ascii;
>  name="ldmd.log.1"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="ldmd.log.1"
>
> May 09 14:42:01 5Q:tornado rpc.ldmd[403559]: Starting Up (built: Apr  7 2000 
> 13:59:22)
> May 09 14:42:02 5Q:tornado stokes[416515]: run_requester: Starting Up: 
> stokes.metr.ou.edu
> May 09 14:42:02 5Q:tornado stokes[416515]: run_requester: 20000509134202.001 
> TS_ENDT {{UNIDATA,  ".*"}}
> May 09 14:42:02 5Q:tornado 172.24.10.2[414853]: run_requester: Starting Up: 
> 172.24.10.2
> May 09 14:42:02 5Q:tornado 172.24.10.2[414853]: run_requester: 
> 20000509134202.006 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:02 5Q:tornado pqexpire[385034]: Starting Up
> May 09 14:42:02 5Q:tornado 172.31.10.10[407581]: run_requester: Starting Up: 
> 172.31.10.10
> May 09 14:42:02 5Q:tornado 172.31.10.10[407581]: run_requester: 
> 20000509134202.014 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:02 5Q:tornado 172.31.10.18[414959]: run_requester: Starting Up: 
> 172.31.10.18
> May 09 14:42:02 5Q:tornado 172.31.10.18[414959]: run_requester: 
> 20000509134202.018 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:02 5Q:tornado 172.24.240.2[416607]: run_requester: Starting Up: 
> 172.24.240.2
> May 09 14:42:02 5Q:tornado 172.24.240.2[416607]: run_requester: 
> 20000509134202.023 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:02 5Q:tornado 172.24.10.34[416769]: run_requester: Starting Up: 
> 172.24.10.34
> May 09 14:42:02 5Q:tornado 172.24.10.34[416769]: run_requester: 
> 20000509134202.028 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:02 5Q:tornado 172.24.10.66[411211]: run_requester: Starting Up: 
> 172.24.10.66
> May 09 14:42:02 5Q:tornado 172.24.10.66[411211]: run_requester: 
> 20000509134202.033 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:02 5Q:tornado 172.31.10.10[407581]: FEEDME(172.31.10.10): OK
> May 09 14:42:02 5Q:tornado 172.24.240.2[416607]: FEEDME(172.24.240.2): OK
> May 09 14:42:02 5Q:tornado 172.31.10.18[414959]: FEEDME(172.31.10.18): OK
> May 09 14:42:02 5Q:tornado pqact[416893]: Starting Up
> May 09 14:42:02 5Q:tornado 172.24.10.2[414853]: FEEDME(172.24.10.2): OK
> May 09 14:42:02 5Q:tornado 172.24.10.34[416769]: FEEDME(172.24.10.34): OK
> May 09 14:42:02 5Q:tornado 172.24.10.66[411211]: FEEDME(172.24.10.66): OK
> May 09 14:42:02 5Q:tornado stokes[416515]: FEEDME(stokes.metr.ou.edu): OK
> May 09 14:42:03 3Q:tornado pqact[416893]: pbuf_flush (5) write: Broken pipe
> May 09 14:42:03 3Q:tornado pqact[416893]: pbuf_flush (6) write: Broken pipe
> May 09 14:42:03 5Q:tornado pqact[416893]: child 411698 exited with status 127
> May 09 14:42:03 5Q:tornado pqact[416893]: child 411263 exited with status 127
> May 09 14:42:03 5Q:tornado localhost[401572]: Connection from localhost
> May 09 14:42:03 5Q:tornado localhost[401572]: Connection reset by peer
> May 09 14:42:03 5Q:tornado localhost[401572]: Exiting
> May 09 14:42:24 5Q:tornado orion[417304]: Connection from orion.nssl.noaa.gov
> May 09 14:42:24 5Q:tornado orion(feed)[417304]: Starting Up: 
> 20000509145557.631 TS_ENDT {{NMC3,  ".*"}}
> May 09 14:42:24 5Q:tornado orion(feed)[417304]: topo:  orion.nssl.noaa.gov 
> NMC3
> May 09 14:42:36 3Q:tornado pqact[416893]: pbuf_flush (9) write: Broken pipe
> May 09 14:42:36 3Q:tornado pqact[416893]: pipe_dbufput: 
> ldmConnect-e.ZONES0005data/weather write error
> May 09 14:42:36 3Q:tornado pqact[416893]: pipe_prodput: trying again
> May 09 14:42:36 5Q:tornado pqact[416893]: child 416971 exited with status 127
> May 09 14:42:36 5Q:tornado pqact[416893]: child 417258 exited with status 127
> May 09 14:42:36 5Q:tornado pqact[416893]: child 400256 exited with status 127
> May 09 14:42:45 5Q:tornado pqact[416893]: child 416556 exited with status 127
> May 09 14:42:45 5Q:tornado pqact[416893]: child 408101 exited with status 127
> May 09 14:43:11 5Q:tornado pqact[416893]: child 417123 exited with status 127
> May 09 14:43:11 5Q:tornado pqact[416893]: child 413421 exited with status 127
> May 09 14:43:26 5Q:tornado pqact[416893]: child 417133 exited with status 127
> May 09 14:43:26 5Q:tornado pqact[416893]: child 416810 exited with status 127
> May 09 14:43:33 5Q:tornado pqact[416893]: child 415912 exited with status 127
> May 09 14:43:33 5Q:tornado pqact[416893]: child 411493 exited with status 127
> May 09 14:43:38 3Q:tornado pqact[416893]: pbuf_flush (9) write: Broken pipe
> May 09 14:43:38 3Q:tornado pqact[416893]: pipe_dbufput: 
> ldmConnect-e.ZONES0005data/weather write error
> May 09 14:43:38 3Q:tornado pqact[416893]: pipe_prodput: trying again
> May 09 14:43:38 5Q:tornado pqact[416893]: child 401986 exited with status 127
> May 09 14:43:38 5Q:tornado pqact[416893]: child 417221 exited with status 127
> May 09 14:43:38 5Q:tornado pqact[416893]: child 417362 exited with status 127
> May 09 14:43:39 5Q:tornado pqact[416893]: child 416921 exited with status 127
> May 09 14:43:39 5Q:tornado pqact[416893]: child 416847 exited with status 127
> May 09 14:43:41 5Q:tornado pqact[416893]: child 416714 exited with status 127
> May 09 14:43:41 5Q:tornado pqact[416893]: child 416454 exited with status 127
> May 09 14:44:15 3Q:tornado pqact[416893]: pbuf_flush (12) write: Broken pipe
> May 09 14:44:15 3Q:tornado pqact[416893]: pipe_dbufput: 
> ldmConnect-e.ZONES0005data/weather write error
> May 09 14:44:15 3Q:tornado pqact[416893]: pipe_prodput: trying again
> May 09 14:44:15 5Q:tornado pqact[416893]: child 416559 exited with status 127
> May 09 14:44:15 5Q:tornado pqact[416893]: child 416106 exited with status 127
> May 09 14:44:15 5Q:tornado pqact[416893]: child 417147 exited with status 127
> May 09 14:44:17 5Q:tornado pqact[416893]: child 417240 exited with status 127
> May 09 14:44:17 5Q:tornado pqact[416893]: child 416273 exited with status 127
> May 09 14:44:17 5Q:tornado pqact[416893]: child 416940 exited with status 127
> May 09 14:44:17 5Q:tornado pqact[416893]: child 417069 exited with status 127
> May 09 14:44:21 5Q:tornado pqact[416893]: child 417193 exited with status 127
> May 09 14:44:21 5Q:tornado pqact[416893]: child 416563 exited with status 127
> May 09 14:44:35 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:44:55 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:44:59 5Q:tornado pqact[416893]: child 416684 exited with status 127
> May 09 14:44:59 5Q:tornado pqact[416893]: child 417523 exited with status 127
> May 09 14:44:59 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:45:05 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:45:15 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:45:18 5Q:tornado pqact[416893]: child 416727 exited with status 127
> May 09 14:45:18 5Q:tornado pqact[416893]: child 415324 exited with status 127
> May 09 14:45:23 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:45:43 5Q:tornado last message repeated 2 times
> May 09 14:45:53 5Q:tornado 172.31.10.10[407581]: Growing data by 3522560
> May 09 14:46:04 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:46:26 5Q:tornado last message repeated 2 times
> May 09 14:46:33 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:46:38 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:46:40 5Q:tornado pqact[416893]: child 416597 exited with status 127
> May 09 14:46:40 5Q:tornado pqact[416893]: child 417422 exited with status 127
> May 09 14:46:55 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:47:19 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:47:46 5Q:tornado 172.24.10.34[416769]: Growing data by 3522560
> May 09 14:48:39 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:49:26 5Q:tornado pqact[416893]: child 417254 exited with status 127
> May 09 14:49:26 5Q:tornado pqact[416893]: child 415951 exited with status 127
> May 09 14:50:09 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:51:04 5Q:tornado 172.24.10.34[416769]: Growing data by 3522560
> May 09 14:51:54 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:52:05 5Q:tornado pqexpire[385034]: > Recycled  31232.351 kb/hr (  
> 1963.668 prods per hour)
> May 09 14:52:09 5Q:tornado pqact[416893]: child 395305 exited with status 127
> May 09 14:52:09 5Q:tornado pqact[416893]: child 417376 exited with status 127
> May 09 14:53:38 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:55:26 5Q:tornado last message repeated 2 times
> May 09 14:56:30 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:56:48 5Q:tornado pqact[416893]: child 416561 exited with status 127
> May 09 14:56:48 5Q:tornado pqact[416893]: child 417307 exited with status 127
> May 09 14:57:01 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:57:11 5Q:tornado pqexpire[385034]: > Recycled  32670.249 kb/hr (  
> 2028.700 prods per hour)
> May 09 14:57:14 5Q:tornado pqact[416893]: child 408745 exited with status 127
> May 09 14:57:14 5Q:tornado pqact[416893]: child 411542 exited with status 127
> May 09 14:57:19 5Q:tornado pqact[416893]: child 415434 exited with status 127
> May 09 14:57:19 5Q:tornado pqact[416893]: child 416671 exited with status 127
> May 09 14:57:25 5Q:tornado pqact[416893]: child 417242 exited with status 127
> May 09 14:57:25 5Q:tornado pqact[416893]: child 407247 exited with status 127
> May 09 14:57:30 5Q:tornado pqact[416893]: child 410660 exited with status 127
> May 09 14:57:30 5Q:tornado pqact[416893]: child 413995 exited with status 127
> May 09 14:58:24 5Q:tornado stokes[416515]: Growing data by 3522560
> May 09 14:59:16 5Q:tornado 172.24.10.2[414853]: Growing data by 3522560
> May 09 14:59:16 5Q:tornado 172.31.10.10[407581]: Growing data by 3522560
> May 09 14:59:22 5Q:tornado rpc.ldmd[403559]: child 416515 terminated by 
> signal 8
> May 09 14:59:22 5Q:tornado rpc.ldmd[403559]: Killing (SIGINT) process group
> May 09 14:59:22 5Q:tornado rpc.ldmd[403559]: Interrupt
> May 09 14:59:22 5Q:tornado rpc.ldmd[403559]: Exiting
> May 09 14:59:22 5Q:tornado pqexpire[385034]: Interrupt
> May 09 14:59:22 5Q:tornado DCSYNOP[416675]: Interrupt Signal
> May 09 14:59:22 5Q:tornado DCSYNOP[416074]: Interrupt Signal
> May 09 14:59:22 5Q:tornado DCUAIR[416058]: Interrupt Signal
> May 09 14:59:22 5Q:tornado DCHRLY[415135]: Interrupt Signal
> May 09 14:59:22 5Q:tornado orion(feed)[417304]: Interrupt
> May 09 14:59:22 5Q:tornado orion(feed)[417304]: Exiting
> May 09 14:59:22 5Q:tornado pqact[416893]: Interrupt
> May 09 14:59:22 5Q:tornado pqact[416893]: Exiting
> May 09 14:59:22 5Q:tornado rpc.ldmd[403559]: Terminating process group
> May 09 14:59:22 5Q:tornado rpc.ldmd[403559]: child 414853 terminated by 
> signal 9
> May 09 14:59:54 5Q:tornado 172.24.240.2[416607]: Interrupt
> May 09 14:59:54 5Q:tornado 172.24.240.2[416607]: Exiting
> May 09 14:59:54 5Q:tornado 172.31.10.18[414959]: Growing data by 3522560
> May 09 14:59:54 5Q:tornado rpc.ldmd[403559]: child 385034 terminated by 
> signal 11
> May 09 14:59:54 5Q:tornado rpc.ldmd[403559]: Killing (SIGINT) process group
> May 09 14:59:54 5Q:tornado rpc.ldmd[403559]: child 407581 terminated by 
> signal 4
> May 09 14:59:54 5Q:tornado rpc.ldmd[403559]: Killing (SIGINT) process group
> May 09 15:00:49 5Q:tornado 172.24.10.66[411211]: Growing data by 3522560
> May 09 15:00:49 5Q:tornado rpc.ldmd[403559]: child 414959 terminated by 
> signal 11
> May 09 15:00:49 5Q:tornado rpc.ldmd[403559]: Killing (SIGINT) process group
> May 09 15:00:49 5Q:tornado 172.24.10.34[416769]: Growing data by 3522560
> May 09 15:00:49 5Q:tornado rpc.ldmd[403559]: child 411211 terminated by 
> signal 9
> May 09 15:00:49 5Q:tornado rpc.ldmd[403559]: child 416769 terminated by 
> signal 9
>
> --------------32C20D4AC9133573FD86A436--

Hi Jason,

I'm not sure what you mean by 'outages' with the software.  Do you mean simply 
that it crashes?

The log entry "child <XXX> exited with status 127" means that an exec failed.  
It looks like pqact is trying
to start some processes and failing.  This could cause percolating problems 
causing the cascade of
'terminated' entries at the end of your log.

The exec failure is often due to path, ownership, permissions, or disk space 
problems, or a problem with the
command being exec'ed itself.   Make sure that all directories needed by 
decoders exist and are writable by
'ldm' and that there's enough space on the disk.  Make sure that you're running 
as user 'ldm'.

There are a few related questions and answers in our support database.  See
http://www.unidata.ucar.edu/cgi-bin/mfs/65/2965?18#mfs for a discussion of how 
to put pqact in quiet,
verbose, and debug modes to get more information.  See 
http://www.unidata.ucar.edu/cgi-bin/mfs/65/3581?79#mfs
for a discussion about running pqact entries individually by hand in an effort 
to isolate the problem.

Try these and see if you can fix the 'child exited with status 127" problem, 
then see if the other problems
go away.  We're here if you still have problems.  Good luck.

Anne

--
***************************************************
Anne Wilson                     UCAR Unidata Program
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************