[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: Re: problem with ldmadmin queuecheck]




-------- Original Message --------
Subject: Re: problem with ldmadmin queuecheck
Date: Mon, 23 Sep 2002 10:52:24 -0600
From: Anne Wilson <address@hidden>
Organization: UCAR
To: Jim Cowie <address@hidden>
CC: address@hidden
References: <address@hidden>

Hi Jim,

As you found, ldmadmin queuecheck, doesn't find all problems with the
queue.

Exactly for the reasons you described, we added a new option to pqcat,
the -s option, which is available in LDM V5.2.  We wanted to be able to
determine a priori if a queue was corrupt.  'pqcat -s' scans through a
product queue counting the number of products it finds.  It then
compares that number to the number the queue thinks it should have.  If
the numbers agree it returns 0, otherwise it returns 1.  This will not
work if the LDM is running, as the number of products in the queue is
constantly changing, but does provide a level of queue sanity checking
if the LDM is not running.  Note that it does not look at the products
themselves.

Here's a snippet from a boot script that uses this:

                echo "Starting LDM ($LDMBIN/rpc.ldmd) via boot script."
                if [ ! -f $LDMHOME/data/ldm.pq ] ; then
                  echo "$LDMHOME/data/ldm.pq does not exist, making new
queue."
                  /bin/su - ldm -c "$LDMBIN/ldmadmin mkqueue"
                else
                  # queue exists, test queue "sanity"
                  pqcat -s -l /dev/null
                  if test $? != 0
                  then
                    echo "Queue appears corrupt, deleting and
rebuilding."
                    /bin/su - ldm -c "$LDMBIN/ldmadmin delqueue"
                    /bin/su - ldm -c "$LDMBIN/ldmadmin mkqueue"
                  else
                    echo "Using existing queue."
                  fi
                fi
                # In case of unclean shutdown, remove pid and lck files
                /bin/su - ldm -c "$LDMBIN/ldmadmin clean"
                /bin/su - ldm -c "$LDMBIN/ldmadmin start"

This saves us from rebuilding a queue unnecessarily.  However, it's not
free - it does take a bit a time depending on the queue size.  For our
7GB queue we've timed it at about six minutes.   In that case, if we do
our own clean shut down before a reboot and are thus sure that the queue
is good, we bypass the sanity check and simply restart the LDM by hand.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************


Jim Cowie wrote:
> 
> I'm trying to come up with a fool-proof way of knowing whether the
> LDM product queue is intact after a system reboots. The ldmadmin
> perl script includes a function called "queuecheck" which runs
> pqcat to read through the queue and returns a 1 status if it detects
> a problem. This is great, but doesn't always seem to work.
> 
> For example, I currently have a corrupted queue because of a sudden
> reboot. My ldm log says:
> 
> Sep 23 16:17:16 vis rpc.ldmd[17335]: Starting Up (built: Jun 21 2002 10:16:41)
> Sep 23 16:17:16 vis ofour[17337]: run_requester: Starting Up: 
> ofour.rap.ucar.edu
> Sep 23 16:17:16 vis pqact[17336]: Starting Up
> Sep 23 16:17:16 vis front[17338]: run_requester: Starting Up: 
> front.rap.ucar.edu
> Sep 23 16:17:18 vis localhost[17346]: Connection from localhost
> Sep 23 16:17:18 vis localhost[17346]: Connection reset by peer
> Sep 23 16:17:18 vis localhost[17346]: Exiting
> Sep 23 16:17:21 vis front[17338]: run_requester: 20020923151716.313 TS_ENDT
> {{WMO,  ".*"}}
> Sep 23 16:17:21 vis ofour[17337]: run_requester: 20020923151716.239 TS_ENDT
> {{WMO,  ".*"}}
> Sep 23 16:17:21 vis ofour[17337]: FEEDME(ofour.rap.ucar.edu): OK
> Sep 23 16:17:22 vis front[17338]: FEEDME(front.rap.ucar.edu): OK
> Sep 23 16:17:22 vis front[17338]: assertion "rl->nelems + rl->nfree + 
> rl->nempty
> == rl->nalloc" failed: file "pq.c", line 1993
> Sep 23 16:17:23 vis ofour[17337]: assertion "rl->nelems + rl->nfree + 
> rl->nempty
> == rl->nalloc" failed: file "pq.c", line 1993
> Sep 23 16:17:29 vis rpc.ldmd[17335]: child 17337 terminated by signal 6
> Sep 23 16:17:29 vis rpc.ldmd[17335]: Killing (SIGINT) process group
> Sep 23 16:17:29 vis rpc.ldmd[17335]: Interrupt
> Sep 23 16:17:29 vis rpc.ldmd[17335]: Exiting
> Sep 23 16:17:29 vis pqact[17336]: Interrupt
> Sep 23 16:17:29 vis pqact[17336]: Exiting
> Sep 23 16:17:29 vis rpc.ldmd[17335]: Terminating process group
> Sep 23 16:17:29 vis rpc.ldmd[17335]: child 17338 terminated by signal 6
> Sep 23 16:17:29 vis rpc.ldmd[17335]: Killing (SIGINT) process group
> 
> Clearly, there is a problem with the queue. But when I run ldmadmin
> queuecheck, I get a 0 exit status indicating that the queue is OK.
> I could use brute force and always create a new queue, but I'd like
> to be able to determine if it is corrupted or not.
> 
> Does anybody know if this is *supposed* to work? I am running ldm 5.1.4
> on Solaris 7 (x86-intel).
> 
> --
> Jim Cowie
> NCAR/RAP
> address@hidden
> 303-497-2831