[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030318: ldm on waldo at stc; McIDAS-XCD scouring moved to 'ldm' account



>From: "Anderson, Alan C. " <address@hidden>
>Organization: St. Cloud State
>Keywords: 200303181640.h2IGeXB2004042 LDM-6 McIDAS-XCD scour

Alan,

>Noticed that our ldm has stopped getting data from papagayo
>as of about 10Z on 17 Mar.  My log files seemed ok up to that 
>time,  then data log stopped.  I have checked with Clint, see
>his response below.
>Any suggestions.

OK.  The messages in Clint's log file confirm/demonstrate the inability
of his LDM to send you data.

>Have stopped and restarted my ldm this morning, but it is still 
>not ingesting.

I logged on and was able to run notifyme to papagayo to verify that
nothing has changed on Clint's side (allows, etc.):

<as 'ldm'>
notifyme -vxl- -f ANY -o 3600 -h papagayo.unl.edu

Data lists came back immediately proving that Clint's machine is
correctly setup to allow feeds from waldo.

I then ran top and noticed that the load average on waldo was 44.
Since this is extremely unusual, I decided to shutdown the LDM
and run some checks on the queue.

/usr/local/ldm% pqcat -s -q data/ldm.pq -l-
Mar 18 18:28:36 pqcat: Starting Up (9152)
Mar 18 18:28:36 pqcat: assertion "IsAlloc(rep)" failed: file "pq.c", line 1907
Abort (core dumped)

This looked as though the queue was corrupted, so I decided to try and
delete and remake it:

/usr/local/ldm% ldmadmin delqueue
/usr/local/ldm/data/ldm.pq: No such file or directory

After verifying that there was still a link between /var/data/ldm and
/usr/local/ldm/data, I looked for a queue:

/usr/local/ldm% cd data
/usr/local/ldm/data% ls -alt
total 22
drwxr-xr-x   5 ldm      data         512 Mar 18 18:28 ./
drwxr-xr-x   2 ldm      data        6656 Mar 18 16:08 logs/
drwxrwxr-x   4 ldm      data         512 Nov  6 21:01 gempak/
drwxrwxr-x   3 ldm      data         512 Sep 25 01:00 surface/
drwxrwxr-x   4 ldm      data         512 Nov 24  1999 ../

So, your problem was that your LDM queue somehow got deleted!

I remade the queue and then restarted your LDM:

/usr/local/ldm% ldmadmin mkqueue
/usr/local/ldm% ldmadmin start

Data is once again flowing into waldo.  Now, the question is how the
LDM queue got deleted!?

While I was on waldo, I decided to move the scouring of McIDAS-XCD
produced data files to the 'ldm' account:

<as 'ldm'>
cd util                   <- ~ldm/util is in the PATH for 'ldm'
cp ~mcidas/workdata/mcscour.sh .

<I looked at the contents of mcscour.sh to make sure that all the
environment variables are set correctly, and they are>

I changed the mcscour.sh logging from /home/mcidas/workdata/scour.log
to ~ldm/logs/mcscour.log.  This puts almost all of your LDM related
log files into ~ldm/logs.  The only one that I didn't move/change
was /home/mcidas/workdata/ROUTEPP.LOG.  This can easily be moved
by editing the MCLOG setting in ~ldm/decoders/batch.k.

Next, I moved McIDAS ADDE server logging from ~mcidas/workdata to
~ldm/logs.  This required that I:

o setup a McIDAS REDIRECTion for SERVER.* in the 'mcidas' account
o change the permissions on /var/data/ldm/logs so that it was
  group writable (mcidas and mcadde are in the same group as ldm)
o move ~mcidas/workdata/SERVER.LOG to ~ldm/logs and change its
  permission to be writable by mcadde
o add a cron entry to 'ldm's crontab to rotate the SERVER.LOG* files

Then, since the dostats action is commented out in 'ldm's crontab
file, I edited ~ldm/etc/ldmd.conf to stop pqbinstats from running.
This prevents the .stats files from being created in ~ldm/logs.
This is necessary since the bin/ldmadmin dostats action normally
run from cron is what scours the .stats files.

The last thing I did was run ~ldm/util/mcscour.sh "by hand" as 'ldm'
to make sure that it worked.  It apparently does since the March 16
.XCD file in /var/data/mcidas and its associated .IDX files were
scoured off.  This leaves that file system with about 3.5 GB of
space:

% df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/proc                      0       0       0     0%    /proc
/dev/dsk/c0d0s0      7396768 3681199 3641602    51%    /
fd                         0       0       0     0%    /dev/fd
swap                  802576     312  802264     1%    /tmp


Recap:

- the LDM was not receiving data since something had deleted the LDM
  queue even though the LDM was till running.  I remade the queue
  and restarted the LDM.  Data is being received and processed
  normally once again

- I moved the XCD scouring to an 'ldm' cron job and move the log
  file to ~ldm/logs/mcscour.log

- I move the McIDAS ADDE remote server logging to ~ldm/logs and setup
  a cron entry to rotate the log files

- I stopped pqbinstats from being run at LDM startup

We need to keep an eye on the McIDAS-XCD scouring done by mcscour.sh
to make sure that it continues to work.  

Please let me know if you see anything amiss on waldo.

Tom

>-----Original Message-----
>From: Clint Rowe [mailto:address@hidden
>Sent: Tuesday, March 18, 2003 10:33 AM
>To: Anderson, Alan C. 
>Subject: Re: ldm at papagayo
>
>
>Alan,
>I seem to have all the data and papagayo's been chugging along without any
>problems.  There are some errors regarding waldo in yesterday's log file:
>
>Mar 17 10:10:08 papagayo waldo(feed)[4767]: up6.c:168: HEREIS: RPC: Unable to 
>send; errno = Broken pipe
>Mar 17 10:10:08 papagayo waldo(feed)[4767]: up6.c:369: Product send failure: I
> /O 
>error
>Mar 17 10:10:16 papagayo rpc.ldmd[28230]: child 4767 exited with status 6
>
>...
>
>Mar 17 10:21:58 papagayo waldo(feed)[28849]: up6.c:168: HEREIS: RPC: Unable to
>  
>send; errno = Broken pipe
>Mar 17 10:21:58 papagayo waldo(feed)[28849]: up6.c:369: Product send failure: 
>I/O error
>Mar 17 10:22:06 papagayo rpc.ldmd[28230]: child 28849 exited with status 6
>
>...
>
>Mar 17 10:35:22 papagayo waldo(feed)[28847]: up6.c:168: HEREIS: RPC: Unable to
>  
>send; errno = Broken pipe
>Mar 17 10:35:22 papagayo waldo(feed)[28847]: up6.c:369: Product send failure: 
>I/O error
>Mar 17 10:35:30 papagayo rpc.ldmd[28230]: child 28847 exited with status 6
>
>I think the problem is at your end, as I'm getting data and nobody else has 
>complained.
>
>Let me know if you can't get restarted.
>Clint
>
>
>>
>>Hi Clint
>>
>>We stopped getting data from papagayo yesterday,  Mar. 17 at about  10Z
>>
>>Is there a problem at unl ?
>>
>>Alan Anderson
>>St. Cloud State



NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.