[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: aeolus problems - LDM dying



Russ Rew wrote:
> 
> 
> After thinking about it more, it turns out if the signature (an array
> of 4 ints) was all zeros, n would be zero and the assertion "n>0" would
> be violated, even though n was an unsigned int.
> 
> Although it's not supposed to be possible to get an all-zero signature
> (it's the result of an MD5 digest of a product), it also seems likely
> that a memory failure might be manifested as reading all zeros for a
> memory fetch, or that a disk corruption might have the symptom of
> zeroing out some bytes on the disk where signatures were stored.
> 

So, this leaves us in an unresolved state.  From the system logs we saw
that aeolus had a CPU panic and rebooted itself at 07:56 local time. 
And, an hour later it corrected a memory error.  But, the assertion
violation errors reported in the ldm logs that caused the crashes
occurred hours later.

I also can't explain the bad latencies that were logged for only a few
products:

ldmd.log.3:Feb 05 17:41:04 aeolus motherlode[1329]: skipped:
20020205160304.032 (2280.714 seconds)
ldmd.log.3:Feb 05 18:03:47 aeolus motherlode[1329]: skipped:
20020205164524.036 (1102.943 seconds)
ldmd.log.2:Feb 05 22:23:24 aeolus motherlode[3932]: skipped:
20020205211554.685 (449.618 seconds)
ldmd.log.1:Feb 05 22:59:31 aeolus motherlode[4249]: skipped:
20020205215159.267 (451.825 seconds)

In two out of four crashes that I am aware of, these skipped products
occurred immediately before the assertion failure.  In a third crash two
products were skipped well before the crash, and in the fourth crash
there were no such skips.  I guess the bad latencies are unrelated to
crash, and must just reflect some problem in the connection during that
5+ hour time period.  Although, it seems odd that just a few would have
such bad latencies.  

So, we can't say for sure what went wrong.  I suggest we watch aeolus
for the rest of the day and if it behaves properly then send out a note
to the effect that downstream sites could reconnect, although perhaps
with a caveat...

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************