[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050214: LDM product queue corruption



Gabe,

>Date: Mon, 14 Feb 2005 13:47:01 -0500 (EST)
>From: Gabe Langbauer <address@hidden>
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20050214: LDM product queue corruption 

The above message contained the following:

> I am unsure where this ldmping initiated from.  My ldm crontab is as
> follows:
> 35 * * * * /usr/local/ldm/bin/ldmadmin dostats
> 0 0 * * * /usr/local/ldm/bin/ldmadmin newlog

"ldmadmin dostats", eh?  That command is no longer useful.  I don't
think it could affect a running LDM, but, just to be sure, do the
following:

    1.  Remove the "ldmadmin dostats" command from the LDM user's
        crontab(1) file.

    2.  Have the following entry enabled in the LDM
        configuration-file, etc/ldmd.conf:

            exec        "pqbinstats"

        The pqbinstats(1) program saves statistics on the LDM system in
        *.stats files in the LDM user's "logs" subdirectory.

    2.  Add the following entry to the scour(1) configuration-file,
        etc/scour.conf:

            ~ldm/logs   1       *.stats

        This ensures that the number of *.stats files won't increase
        indefinitely.

> #  Check for incoming data and failover if upstream site is dead
> #10,30,50 * * * * /usr/local/ldm/bin/ldmfail -p stokes.metr.ou.edu -f
> pluto.met.fsu.edu > /dev/null 2>&1 /dev/null
>   
> #  Scour the data directories
> 0 * * * * /usr/local/ldm/bin/ldmadmin scour > /dev/null 2>&1
>   
> #  Rotate and remove the decoder logs - the trailing digit
> #  tells the script how many days of logs to keep
> #
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcacars.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcamos.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcmmos.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcnmos.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcnldn.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcncprof.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dctrop.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcwatch.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcffg.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcstorm.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcgrib.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dchrly.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcsynop_sb.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcsynop_syn.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcuair.log 1
> 
> So, the only things going on here are rotating logs and some stats.  A
> check of my gempak crontab (ldm and gempak are virtually the only things
> running on the machine) shows nothing occuring at ~20:43 except
> scripts that are called at the same time every hour or possibly
> ngm.csh which is called at 20:00  ngm.csh simply is a script that calls
> other ngm scripts to create gempak products. we source the Gemenviron and
> set the display then use the 'date' command to get the current time then
> run gempak.  Nowhere is there any mention of ldm nor do I believe it would
> have permissions to make a call such as ldmping 

You've got to find-out where that ldmping(1) came from to ensure that
whatever's causing it isn't interfering in other ways with the LDM.

> Another interesting development occured this weekend.  I was able to
> "capture ldm in the act".  LDM crashed around 00:15 UTC  and I
> realized that it was down.

Can you send me the log entries for that time?

> I ssh'd in and issued the command ldmadmin
> clean.

Did you ensure that the LDM system wasn't running?  Doing an "ldmadmin
clean" when the LDM is running will cause the *.pid file to be removed
and could result in "orphaned" LDM processes.

> This commmand successfully completed.  I then issued the commmand
> ldmadmin start this command appeared to work correctly.  However, when I
> issued ldmadmin watch I was given the message "there is no ldm running on
> this machine"

That's mighty suspicious.  Can you send me the log entries for that
time?

> I tried this same sequence a couple more times and I
> delqueued and mkqueued and physically removed (via rm) the pid file and so
> forth.  LDM however refused to start.  Immediatly at 01:00 UTC I issued
> the same command ldmadmin clean && ldmadmin start as I had done several
> times during the previous hour.  Magically, this time it worked.  This
> leads me to believe that there was some program running at that time that
> immediatly corrupted the ldm.  But I'm unsure what could be responsible

Any chance of my logging onto the system in question to examine the LDM
setup?

Regards,
Steve Emmerson