[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: [Fwd: aeolus problems - LDM dying]]



anne wrote:
> 
> anne wrote:
> >
> > Hi Russ and Mike,
> >
> > Larry Riddle's LDM, on aeolus, an OSF1 alpha, is having problems.  It
> > keeps shutting down with same error message reported in the log:
> >
> > ldmd.log.1:Feb 05 22:51:59 aeolus motherlode[4249]: run_requester:
> > Starting Up: motherlode.ucar.edu
> > ldmd.log.1:Feb 05 22:59:29 aeolus motherlode[4249]: run_requester:
> > 20020205215159.112 TS_ENDT {{FSL2|UNIDATA,  ".*"},{NNEXRAD,
> > ".*"},{DIFAX,  ".*"}}
> > ldmd.log.1:Feb 05 22:59:30 aeolus motherlode[4249]:
> > FEEDME(motherlode.ucar.edu): OK
> > ldmd.log.1:Feb 05 22:59:31 aeolus motherlode[4249]: RECLASS:
> > 20020205215931.091 TS_ENDT {{FSL2|UNIDATA,       ".*"},{NNEXRAD,
> > ".*"},{DIFAX,    ".*"}}
> > ldmd.log.1:Feb 05 22:59:31 aeolus motherlode[4249]: skipped:
> > 20020205215159.267 (451.825 seconds)
> > ldmd.log.1:Feb 05 22:59:32 aeolus motherlode[4249]: assertion "n > 0"
> > failed: file "pq.c", line 2172
> > -----
> > ldmd.log.2:Feb 05 22:15:54 aeolus motherlode[3932]: run_requester:
> > Starting Up: motherlode.ucar.edu
> > ldmd.log.2:Feb 05 22:23:23 aeolus motherlode[3932]: run_requester:
> > 20020205211554.650 TS_ENDT {{FSL2|UNIDATA,  ".*"},{NNEXRAD,
> > ".*"},{DIFAX,  ".*"}}
> > ldmd.log.2:Feb 05 22:23:23 aeolus motherlode[3932]:
> > FEEDME(motherlode.ucar.edu): OK
> > ldmd.log.2:Feb 05 22:23:24 aeolus motherlode[3932]: RECLASS:
> > 20020205212324.304 TS_ENDT {{FSL2|UNIDATA,       ".*"},{NNEXRAD,
> > ".*"},{DIFAX,    ".*"}}
> > ldmd.log.2:Feb 05 22:23:24 aeolus motherlode[3932]: skipped:
> > 20020205211554.685 (449.618 seconds)
> > ldmd.log.2:Feb 05 22:23:25 aeolus motherlode[3932]: assertion "n > 0"
> > failed: file "pq.c", line 2172
> > -----
> > ldmd.log.3:Feb 05 17:00:29 aeolus motherlode[1329]: run_requester:
> > Starting Up: motherlode.ucar.edu
> > ldmd.log.3:Feb 05 17:00:29 aeolus motherlode[1329]: run_requester:
> > 20020205160029.865 TS_ENDT {{FSL2|UNIDATA,  ".*"},{NNEXRAD,
> > ".*"},{DIFAX,  ".*"}}
> > ldmd.log.3:Feb 05 17:00:30 aeolus motherlode[1329]:
> > FEEDME(motherlode.ucar.edu): OK
> > ldmd.log.3:Feb 05 17:41:04 aeolus motherlode[1329]: RECLASS:
> > 20020205164104.746 TS_ENDT {{FSL2|UNIDATA,       ".*"},{NNEXRAD,
> > ".*"},{DIFAX,    ".*"}}
> > ldmd.log.3:Feb 05 17:41:04 aeolus motherlode[1329]: skipped:
> > 20020205160304.032 (2280.714 seconds)
> > ldmd.log.3:Feb 05 18:03:47 aeolus motherlode[1329]: RECLASS:
> > 20020205170346.979 TS_ENDT {{FSL2|UNIDATA,       ".*"},{NNEXRAD,
> > ".*"},{DIFAX,    ".*"}}
> > ldmd.log.3:Feb 05 18:03:47 aeolus motherlode[1329]: skipped:
> > 20020205164524.036 (1102.943 seconds)
> > ldmd.log.3:Feb 05 20:59:38 aeolus motherlode[1329]: assertion "n > 0"
> > failed: file "pq.c", line 2172
> >
> > The function that is failing is this:
> > /*
> >  * Hash function for signature.
> >  */
> > static size_t
> > sx_hash(size_t nchains, const signaturet sig)
> > {
> >   size_t h;
> >   int i;
> >   unsigned int n;
> >
> >   n = 0;
> >   for(i=0; i<4; i++)
> >     n = 256*n + sig[i];
> >   assert(n > 0);
> >   h = n % nchains;
> >   return h;
> > }
> >
> > Perhaps the signatures are being corrupted?
> >
> > It's interesting that the latencies on these skipped products are
> > terrible.  ldmpings from motherlode to aeolus aren't very good,
> > including some in the hundreds of milliseconds:
> >
> > motherlode.ucar.edu% ldmping -i2 aeolus.ucsd.edu
> > Feb 06 01:08:44      State    Elapsed Port   Remote_Host
> > rpc_stat
> > ... (aeolus LDM started here)
> > Feb 06 01:09:40 RESPONDING   0.092502  388   aeolus.ucsd.edu
> > Feb 06 01:09:42 RESPONDING   0.065875  388   aeolus.ucsd.edu
> > Feb 06 01:09:44 RESPONDING   0.038995  388   aeolus.ucsd.edu
> > Feb 06 01:09:46 RESPONDING   0.039381  388   aeolus.ucsd.edu
> > Feb 06 01:09:48 RESPONDING   0.038904  388   aeolus.ucsd.edu
> > Feb 06 01:09:51 RESPONDING   0.039140  388   aeolus.ucsd.edu
> > Feb 06 01:09:53 RESPONDING   0.047059  388   aeolus.ucsd.edu
> > Feb 06 01:09:55 RESPONDING   0.039036  388   aeolus.ucsd.edu
> > Feb 06 01:09:57 RESPONDING   0.039950  388   aeolus.ucsd.edu
> > Feb 06 01:09:59 RESPONDING   0.040719  388   aeolus.ucsd.edu
> > Feb 06 01:10:01 RESPONDING   0.104465  388   aeolus.ucsd.edu
> > Feb 06 01:10:03 RESPONDING   0.050099  388   aeolus.ucsd.edu
> > Feb 06 01:10:05 RESPONDING   0.118380  388   aeolus.ucsd.edu
> > Feb 06 01:10:07 RESPONDING   0.039413  388   aeolus.ucsd.edu
> > Feb 06 01:10:09 RESPONDING   0.050446  388   aeolus.ucsd.edu
> > Feb 06 01:10:11 RESPONDING   0.044901  388   aeolus.ucsd.edu
> > Feb 06 01:10:13 RESPONDING   0.041743  388   aeolus.ucsd.edu
> > Feb 06 01:10:15 RESPONDING   0.039329  388   aeolus.ucsd.edu
> > Feb 06 01:10:17 RESPONDING   0.044745  388   aeolus.ucsd.edu
> > Feb 06 01:10:19 RESPONDING   0.040108  388   aeolus.ucsd.edu
> > Feb 06 01:10:21 RESPONDING   0.050392  388   aeolus.ucsd.edu
> > Feb 06 01:10:23 RESPONDING   0.040905  388   aeolus.ucsd.edu
> > Feb 06 01:10:25 RESPONDING   0.039391  388   aeolus.ucsd.edu
> > Feb 06 01:10:27 RESPONDING   0.058450  388   aeolus.ucsd.edu
> >
> > The queue seems ok:
> >
> > aeolus.ucsd.edu> pqmon -i2
> > Feb 06 01:19:37 pqmon: Starting Up (5892)
> > Feb 06 01:19:37 pqmon: nprods nfree  nempty      nbytes  maxprods
> > maxfree  minempty    maxext  age
> > Feb 06 01:19:37 pqmon: 108327     1   74777   749998248    158714
> > 12     24390      3928 20276
> > Feb 06 01:19:39 pqmon: 108321     1   74783   749993144    158714
> > 12     24390      9032 20271
> > Feb 06 01:19:41 pqmon: 108318     1   74786   750001632    158714
> > 12     24390       544 20267
> > Feb 06 01:19:43 pqmon: 108321     1   74783   749998984    158714
> > 12     24390      3192 20267
> > Feb 06 01:19:45 pqmon: 108328     1   74776   749999056    158714
> > 12     24390      3120 20265
> > Feb 06 01:19:47 pqmon: 108334     1   74770   749997760    158714
> > 12     24390      4416 20266
> > Feb 06 01:19:49 pqmon: 108360     1   74744   750001048    158714
> > 12     24390      1128 20265
> > Feb 06 01:19:51 pqmon: 108372     1   74732   749995496    158714
> > 12     24390      6680 20265
> > Feb 06 01:19:53 pqmon: 108383     1   74721   749997800    158714
> > 12     24390      4376 20262
> > Feb 06 01:19:55 pqmon: 108415     1   74689   749996816    158714
> > 12     24390      5360 20263
> > Feb 06 01:19:55 pqmon: Interrupt
> > Feb 06 01:19:55 pqmon: Exiting
> >
> > I do see some messages in the system log that make me suspicious - these
> > are for Mike:
> >
> > Feb  4 11:40:03 aeolus vmunix:  RFS3_WRITE, client address =
> > 132.239.94.91, errno 22
> > Feb  5 07:56:31 aeolus vmunix: panic (cpu 0): vm_page_activate: already
> > active
> > Feb  5 07:56:31 aeolus vmunix: syncing disks... 237 122 30 done
> > Feb  5 07:56:31 aeolus vmunix: DUMP.prom: dev SCSI 0 6 0 0 300 0
> > FLAMG-IO, block 722079
> > Feb  5 07:56:31 aeolus vmunix: DUMP.prom: dev SCSI 0 6 0 0 300 0
> > FLAMG-IO, block 722079
> > Feb  5 07:56:31 aeolus vmunix: Alpha boot: available memory from
> > 0xbc4000 to 0xe000000
> > Feb  5 07:56:31 aeolus vmunix: Compaq Tru64 UNIX V5.0A (Rev. 1094); Thu
> > Nov 29 07:51:09 PST 2001
> > ...
> > Feb  5 07:57:58 aeolus vmunix: fta0: Link Unavailable.
> > Feb  5 07:58:51 aeolus vmunix: Mouse/Tablet has failed to reset.
> > Feb  5 07:59:19 aeolus last message repeated 2 times
> > Feb  5 08:59:16 aeolus vmunix: Memory error corrected by system
> > Feb  5 08:59:16 aeolus vmunix:  biu_stat        = 0000000000000240
> > Feb  5 08:59:16 aeolus vmunix:  biu_addr        = 00000001d4000018
> > Feb  5 08:59:16 aeolus vmunix:  dc_stat         = 0000000000000007
> > Feb  5 08:59:16 aeolus vmunix:  fill_syndrome   = 0000000000000000
> > Feb  5 08:59:16 aeolus vmunix:  fill_addr       = 0000000000065350
> > Feb  5 08:59:16 aeolus vmunix:  bc_tag          = 003c090000005428
> > Feb  5 08:59:16 aeolus vmunix:  ident           = 0
> >
> > Do you have any ideas about this?
> >
> > My next step will be to rebuild the queue.  I'll save the old queue just
> > in case it might be useful.
> >
> > Anne
> 
> --
> ***************************************************
> Anne Wilson                     UCAR Unidata Program
> address@hidden                  P.O. Box 3000
>                                   Boulder, CO  80307
> ----------------------------------------------------
> Unidata WWW server       http://www.unidata.ucar.edu/
> ****************************************************

-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************