[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20020906: thelma not too good



Hi all,

The uptime.log is even wierder today.  Look at this snippet:

 12:06pm  up 1 day(s), 21:17,  6 users,  load average: 9.14, 10.77,
11.99
 12:07pm  up 1 day(s), 21:18,  6 users,  load average: 7.82, 10.04,
11.64
 12:08pm  up 1 day(s), 21:19,  6 users,  load average: 5.48, 8.91, 11.14
 12:09pm  up 1 day(s), 21:20,  6 users,  load average: 4.11, 7.91, 10.64
 12:10pm  up 1 day(s), 21:21,  6 users,  load average: 3.32, 7.02, 10.14
 12:11pm  up 1 day(s), 21:22,  6 users,  load average: 4.17, 6.61, 9.80
 12:12pm  up 1 day(s), 21:23,  6 users,  load average: 4.37, 6.23, 9.45
 12:13pm  up 1 day(s), 21:24,  6 users,  load average: 4.88, 6.07, 9.20
 12:14pm  up 1 day(s), 21:25,  6 users,  load average: 3.68, 5.58, 8.82
 12:15pm  up 1 day(s), 21:26,  6 users,  load average: 4.66, 5.47, 8.57
 12:16pm  up 1 day(s), 21:27,  6 users,  load average: 4.48, 5.29, 8.31
 12:17pm  up 1 day(s), 21:28,  6 users,  load average: 3.37, 4.86, 7.96
 12:18pm  up 1 day(s), 21:29,  6 users,  load average: 4.57, 4.86, 7.75
 12:19pm  up 1 day(s), 21:30,  6 users,  load average: 5.41, 5.16, 7.68
 12:20pm  up 1 day(s), 21:31,  6 users,  load average: 3.70, 4.70, 7.36
 12:21pm  up 1 day(s), 21:32,  6 users,  load average: 3.75, 4.52, 7.12
 12:22pm  up 1 day(s), 21:33,  6 users,  load average: 2.58, 4.08, 6.80
 12:23pm  up 1 day(s), 21:34,  6 users,  load average: 12.65, 6.56, 7.50
 12:24pm  up 1 day(s), 21:35,  6 users,  load average: 15.59, 8.51, 8.13
 12:25pm  up 1 day(s), 21:36,  6 users,  load average: 17.77, 10.44,
8.84
 12:26pm  up 1 day(s), 21:37,  6 users,  load average: 18.57, 11.98,
9.50

I can't correlate that 12:23 moment with anything in the LDM logs or the
system logs.  (/var/adm/messages is practically empty.)  

And, here's a traceroute from thelma to Penn State:

/local/ldm% traceroute ldm.meteo.psu.edu
traceroute: Warning: Multiple interfaces found; using 192.52.106.21 @
ge0
traceroute to ldm.meteo.psu.edu (128.118.28.12), 30 hops max, 40 byte
packets
 1  vbnsr-dmzfnet (192.52.106.10)  0.698 ms  0.690 ms  0.434 ms
 2  mlra-n2 (128.117.2.253)  0.382 ms  0.375 ms  0.594 ms
 3  gin-n243-72 (128.117.243.73)  0.849 ms  0.735 ms  0.565 ms
 4  frgp-gw-1 (128.117.243.34)  1.543 ms  2.415 ms  1.700 ms
 5  198.32.11.105 (198.32.11.105)  2.239 ms  1.709 ms  1.509 ms
 6  kscy-dnvr.abilene.ucaid.edu (198.32.8.14)  12.183 ms  12.184 ms 
12.815 ms
 7  ipls-kscy.abilene.ucaid.edu (198.32.8.6)  22.066 ms  21.362 ms 
21.394 ms
 8  clev-ipls.abilene.ucaid.edu (198.32.8.26)  27.925 ms  27.939 ms 
27.706 ms
 9  abilene.psc.net (192.88.115.122)  31.138 ms  30.860 ms  31.129 ms
10  bar-beast.psc.net (192.88.115.17)  31.111 ms  30.987 ms  31.156 ms
11  psu-i2.psc.net (192.88.115.98)  57.862 ms  42.568 ms  73.063 ms
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

From the LDM log, they're definately losing CONDUIT products.  

It probably would be helpful to get 5.2.1 in place on thelma, and to get
rtstats from Harry and Art.  I think I'll 5.2.1 it on milton this
weekend and let it run a bit to try to ensure it's in a usable state.

Anne


Tom Yoksas wrote:
> 
> >From:  anne <address@hidden>
> >Organization:  UCAR/Unidata
> >Keywords:  200209070333.g873XUj09291
> 
> Anne and Jeff,
> 
> >While thelma looked pretty good about 6:30 today, with a load average
> >around 5, now it's not looking so good.  The load average was about 14,
> >and it was sluggish in responding.
> 
> Nuts.
> 
> >There are only 71 rpc.ldmds at the moment, less than the 72 that I
> >thought we were able to handle easily before the reboot.  There are lots
> >of reclasses to atm, plus some to sunset.aos.wisc.edu.
> 
> >(What's 'aos'?).
> 
> This appears to be f5.aos.wisc.edu.  They are reporting realtime stats,
> and their latencies don't look good.  Seems to me that they should
> be feeding from SSEC, no?
> 
> >And connections are being dropped.
> 
> So, when the load average goes above some level, data stops getting
> delivered reliably and reclass messages ensue.
> 
> >I started a cron job to run uptime every minute to track the load
> >average.  The resulting log is in ~logs/uptime.log.
> 
> The contents of this file are very interesting.  The load average comes
> and goes.  We now need to correlate that with CONDUIT data volume (or
> anything else).
> 
> It seems to me that we need to jump on getting 5.2.1 ready so we can
> get both Washington and Penn State to upgrade to it and run rtstats.
> This should help us understand what is happening at these sites.
> 
> The overnight rtstats from atm and f5.aos are really interesting.
> atm looks OK except for NNEXRAD, and f5 looks bad.  I don't know
> what to make of this!
> 
> Tom
> --
> +-----------------------------------------------------------------------------+
> * Tom Yoksas                                             UCAR Unidata Program 
> *
> * (303) 497-8642 (last resort)                                  P.O. Box 3000 
> *
> * address@hidden                                   Boulder, CO 80307 *
> * Unidata WWW Service                             
> http://www.unidata.ucar.edu/*
> +-----------------------------------------------------------------------------+

-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************