[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Aeolus Problems



Larry Riddle wrote:
> 
> I don't know if it was the change to thelma or the fact that aeolus is not
> running in debug ("Heisenberg") mode, but the ldm on aeolus.ucsd.edu has
> been shutting itself down, two or three times a day, all weekend long.  I
> haven't touched any of the log files, there may be some useful info there.
> 
> For the next time it shuts down, can someone tell me what needs to be done
> to start it up again in debug mode?  When the ghost of Heisenberg is
> watching aeolus, we don't seem to have any trouble.
> 
> Larry
> 
>       ---===---=-=-=-=-=-=-=-=-=-=-=====[\/]=====-=-=-=-=-=-=-=-=-=-=---===---
>     -----===(*  Climate's what we expect, but weather's what we
> get.  *)===-----
>    Larry Riddle : Climate Research Division : Scripps Institution of
> Oceanography
>        University of California, San Diego : La Jolla, California  92093-0224
>        Phone: (858) 534-1869 : Fax: (858) 534-8561 : E-Mail: address@hidden

Hi Larry,

I'm sorry to hear about these problems this weekend!

I looked around aeolus and found no messages reported in the ldm logs. 
However, there were problems in the system log, /var/adm/messages. 
Here's the most recent:


Apr  8 01:42:38 aeolus vmunix: trap: invalid memory write access from
kernel mode
Apr  8 01:42:38 aeolus vmunix:
Apr  8 01:42:38 aeolus vmunix:     faulting virtual address:    
0x0000000000000018
Apr  8 01:42:38 aeolus vmunix:     pc of faulting instruction:  
0xfffffc00003e28e0
Apr  8 01:42:38 aeolus vmunix:     ra contents at time of fault:
0xfffffc00003e2898
Apr  8 01:42:38 aeolus vmunix:     sp contents at time of fault:
0xffffffff930bf900
Apr  8 01:42:38 aeolus vmunix:
Apr  8 01:42:38 aeolus vmunix: panic (cpu 0): kernel memory fault

Over the past four days there are several panic messages:

# grep panic messages
Jan  3 14:19:51 aeolus vmunix: panic (cpu 0): kernel memory fault
Jan  7 09:39:45 aeolus vmunix: panic (cpu 0): kernel memory fault
Jan 19 14:55:42 aeolus vmunix: panic (cpu 0): vm_page_activate: already
active
Feb  5 07:56:31 aeolus vmunix: panic (cpu 0): vm_page_activate: already
active
Mar 12 14:18:00 aeolus vmunix: panic (cpu 0): ialloc: dup alloc
Apr  4 18:48:46 aeolus vmunix: panic (cpu 0): vm_page_activate: already
active
Apr  5 19:31:38 aeolus vmunix: panic (cpu 0): vm_page_activate: already
active
Apr  6 07:34:42 aeolus vmunix: panic (cpu 0): kernel memory fault
Apr  7 20:34:16 aeolus vmunix: panic (cpu 0): vm_page_activate: already
active
Apr  8 01:42:38 aeolus vmunix: panic (cpu 0): kernel memory fault

Mike says this indicates a memory problem, which might also explain the
assertion errors you experienced earlier.  Indeed, the last two 'panic'
messages are each within 6 minutes of the last messages of an active ldm
process that subsequently died.

Mike advised that you remove the memory chips and reseat them to see if
the problem goes away - it could just be a bad connection.  If the
problem reoccurs, apparently the next step is to reorder the chips and
see if the problem stays in the same location or moves with the chips. 
If  it stays in the same location the problem is in the slot, not the
chip, although generally the problem is in the chip.

I would show this to your system administrator.  Because I don't think
this is an ldm problem, I did not put the ldm in debug mode.  If you
would still like to know how to do this, let me know and I'll send it in
a separate email.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************