[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010123: Strange LDM freezes



Unidata Support wrote:
> 
> ------- Forwarded Message
> 
> >To: address@hidden
> >From: Pete Stamus <address@hidden>
> >Subject: Strange LDM freezes
> >Organization: UCAR/Unidata
> >Keywords: 200101230551.f0N5pce13012
> 
> Hi guys.  I've looked through everything I have from Anne's class,
> and searched the web pages, but can't find anything on this.  So....
> 
> We're running LDM 5.1.2 on a Intel box running SunOS 5.7 (a system
> we got from Alden).  I have it set up so pqing reads the NOAAPORT feed
> from a port and fills the queue, then the LDM goes from there--saving
> data and feeding another LDM on a different machine.
> 
> At intervals of 4-7 days or so, usually in the evening, the LDM
> freezes.  What I mean is that the LDM will be sitting there but not
> doing anything...no data gets saved, and no data gets sent to the
> other LDM on the other machine.  An 'ldmadmin stop' command says that
> its shutting down the LDM, but the 'pqing' and 'rpc.ldmd' processes
> don't go away...I usually end up (after trying several less powerful
> things) having to do a 'kill -9'.  There's nothing that I see in the
> logs; here is the complete log (ldmd.log.1) from the last cycle...I
> started it on 18 Jan and it froze on 23 Jan.  Process 28912 was the
> rpc process that would not go away gracefully, and I had to 'kill -9' it.
> 
> Jan 18 23:50:04 noaaport rpc.ldmd[28912]: Starting Up (built: Nov 29 2000 
> 15:21:47)
> Jan 18 23:50:04 noaaport pqbinstats[28913]: Starting Up (28912)
> Jan 18 23:50:04 noaaport pqact[28914]: Starting Up
> Jan 18 23:50:06 noaaport localhost[28923]: Connection from localhost
> Jan 18 23:50:06 noaaport localhost[28923]: Connection reset by peer
> Jan 18 23:50:06 noaaport localhost[28923]: Exiting
> Jan 18 23:50:20 noaaport fen00[28929]: Connection from 
> fen00.colorado-research.com
> Jan 18 23:50:20 noaaport fen00(feed)[28929]: Starting Up: 20010118234947.086 
> TS_ENDT {{ANY,  ".*"}}
> Jan 18 23:50:20 noaaport fen00(feed)[28929]: topo:  
> fen00.colorado-research.com ANY
> Jan 23 05:02:56 noaaport rpc.ldmd[28912]: Exiting
> Jan 23 05:02:56 noaaport rpc.ldmd[28912]: Terminating process group
> Jan 23 05:02:56 noaaport fen00(feed)[28929]: Exiting
> Jan 23 05:02:57 noaaport pqact[28914]: Exiting
> Jan 23 05:02:57 noaaport pqbinstats[28913]: Exiting
> Jan 23 05:04:19 noaaport rpc.ldmd[28912]: _NOT_ ReReading configuration file 
> /home/ldm/etc/ldmd.conf
> 
> After getting rid of whatever processes are left, and restarting the
> LDM, things work fine...for 4-7 days until the next freeze.
> 
> I don't know where to look on this.  It seems like a resource is getting
> used up someplace, but that's just a guess based on the fact that this
> happens at fairly regular intervals.  Disk space was fine when this
> happened, by the way.  Any suggestions as to where to look/start, would
> be helpful and appreciated!
> 
> Thanks.
> ps
> -------------------------------------------------------------------------
> Pete Stamus                          | Phone: (303) 415-9701 x224
> Colorado Research Associates (CoRA)* | Fax:   (303) 415-9702
> 3380 Mitchell Lane                   | email: address@hidden
> Boulder, Colorado 80301  USA         | *( CoRA is a division of NWRA )
> -------------------------------------------------------------------------
>    You can't trust your eyes when your imagination is out of focus.
>                                                       -- Mark Twain
> -------------------------------------------------------------------------
> 
> ------- End of Forwarded Message

Hi Pete,

This is a tough one.

We had another ingest site that was having a similar problem.  Although
we're not positive, one possibility there was that pqing was getting
hung up, perhaps by some bad cables that introduced a lot of noise and
subsequently bad products.  In theory pqing should just crash while
everything else continues to run, but in practise such an event can
apparently bring everything to a halt.

How are you invoking pqing? e.g., are you invoking it from ldmd.conf or
from the command line?  The answer to this question is of particular
interest to me.  Since I didn't see anything about pqing in your log,
I'm guessing that you're invoking it from the command.

We need some more information.  Here are some things to try:

If you have the disk space, run pqing and possibly rpc.ldmd in verbose
mode.  Since the LDM will run for some days before hanging, this will
make for very large log files.  But, hopefully, it will give some
indication of the problem when it occurs again.  If verbose mode doesn't
do it, there's always debug mode.  To save space you can always rotate
the logs more frequently using 'ldmadmin newlog' - you could put this in
a cron job.

When it happens again check they system logs.  The system logs on
Solaris are in /var/adm/messages.

When this occurs, how are the other processes on the machine?  Is it
responsive?  What does 'top' say? What's the system load?  Are the ldm
processes sleeping/running/zombie?  Check pqing and the parent rpc.ldmd
in particular (in your last episode that's the one that you couldn't
kill).  

Maybe there's a problem with the I/O - try 'iostat'.

Just to see what it reports, you could try 'ldmadmin check' and
'ldmadmin queuecheck'.

If this happens at some convenient time, (doesn't it always happen that
way?) you could contact me - I'd be willing and interested to log in and
look around.

I hope this helps.  Let me know what you find.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************