[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20011219: ldmfail failing and sending emails every 20 minutes



>From: "James R. Frysinger" <address@hidden>
>Organization: College of Charleston
>Keywords: 200111061842.fA6Igt112242 LDM binary install

Jim,

>Yikes, I'm getting bombarded by messages coming once every 20 minutes.
>
>After 3-4 days of playing with McIDAS I exerted some self-discipline 
>and left it alone today. Then this evening I opened up my email and 
>found a very full inbox. They all say:
>   No primary or failover found in /export/home/ldm/etc/ldmd.conf
>   From: sysadmin frysingj <address@hidden> (oper)
>   To: address@hidden
>   ldmfail: Dec 20 01:00:00 UTC
>except that the times change in 20 min increments, of course. This 
>onslought started at 10:40:35 local time this morning. All subsequent 
>messages have :00 for the seconds. So that kind of narrows down the 
>time of the glitch.

OK, this was caused by my not being familiar with how ldmfail works
(what it is looking for in ldmd.conf, ldmd.pluto.met..., ldmd.cirp.met...).

ldmfail reads ldmd.conf and looks for a match between the -p or -f
values on its command line AS THE THIRD parameter on a request line.
I had setup your ldmd.conf files in an alternate manner:

request UNIDATA ".*"
        pluto.met.fsu.edu
#       cirp.met.utah.edu

I did this because we setup our ldmd.conf files like this on more than
one machine that we administer.  The only problem is that we _never_
run ldmfail (no need to)!

So, I ssh'd in and changed your ~ldm/etc/ldmd.conf, ldmd.pluto.met.fsu.edu,
and ldmd.cirp.met.utah.edu files to have request lines that read like:

request UNIDATA ".*" pluto.met.fsu.edu

I then tested bin/ldmfail from ~ldm "by hand" and am satisfied that it
will work correctly now (i.e., not bombard you with emails).

>I ssh'd in and did some checking. Did ldmadmin check first and got
>   weather[2] ldmadmin check
>   LDM status report from the logs for the last 42 hours.
>
>   Currently weather is running 92 percent idle
>   load average: 0.04, 0.04, 0.05
>   Running version number 5.1.4.
>   LDM was restarted 1 time(s)
>        Last LDM restart at Dec 19 15:40:33
>
>   Critical LDM problems that need immediate attention:
>
>   Potential LDM Problems:
>   Non-zero Status message occurred 2 time(s).
>        Last one at:  Dec 19 15:40:35
>'   Breaking connection' message occurred 50 time(s).
>        Last one at:  Dec 19 21:37:29
>        For atm it happened 46 time(s).
>        For cirp it happened 1 time(s).
>        For pluto it happened 2 time(s).
>        For striker it happened 1 time(s).
>'   RPC: Timed out' message occurred 3 time(s).
>        Last one at:  Dec 19 16:44:24
>        For FEEDME(atm.geo.nsf.gov): it happened 3 time(s).
>'   NULLPROC error' message occurred 2 time(s).
>        Last one at:  Dec 19 02:31:40
>        For atm.geo.nsf.gov it happened 2 time(s).
>
>Did ldmadmin watch and the stream was coming in, apparently OK. Found 
>that /etc/ldmd.conf has a time stamp of Dec 19 10:40, consistent with 
>some sort of a failure. Doing ll on logs gives (in part)
>       1941 Dec 19 20:32 ldmd.log
>   122161 Dec 19 20:31 ldmd.log.1
>   133660 Dec 19 10:40 ldmd.log.2
>   302063 Dec 18 23:56 ldmd.log.3
>   298775 Dec 17 23:56 ldmd.log.4w
>So the logs rolled at Dec 19 10:40 as well. The next log rollover you 
>see there (Dec 19 20:31) is because of my actions. I scanned the log 
>ending Dec 19 10:40 and it looked like the system was having some 
>trouble. So I did
>   ldmadmin stop
>   ldmadmin clean
>   ldmadmin start
>Alas, that has not stopped the thrice-hourly messages. I did cat on 
>ldmd.conf and the two feeder files for it and all seems normal; nothing 
>got garbled in a failover. I have made archive copies in place of the 
>two logs preceding those two log rollovers, as well as a "normal" one 
>from the day before. Again, the first rollover was done for some reason 
>by the system and the second is because I stopped, cleaned, and started 
>LDM.
>
>I also did ldmadmin queuecheck _just_prior_ to my stop/clean/start and 
>things were A-OK; no messages resulted.
>
>That's about all that I can think of doing, Tom. Where do I go from 
>here?

Case solved; my bad; sorry.  I also deleted all of the emails from the
ldm mail box.

>BTW, my folks both have colds, my wife has a cold, and our daughter in 
>Raleigh just came down with a cold. That's 4 out of 5 people that were 
>going to be in Ohio for Christmas. I'm the only remaining healthy human 
>being that was slated for that confab so we've just cancelled the trip! 
>Nobody feels like traveling and visiting, so I'll be hanging around 
>here.

Major bummer!  Have a nice Christmas at home at least.

Cheers!

Tom