[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

19990319: Post Processing Problems (again...)



>From: "Jennie L. Moody" <address@hidden>
>Organization: UVa
>Keywords: 199903091926.MAA12825 McIDAS ROUTE SYSIMAGE.SAV

Jennie,

>Its a long message.

Groan...

>I was typing it and going off to do
>diagnostics as I typed it up.  You don't have to answer
>much of it, but there is a summarial question at the end:
>
>Sequence of events:
>
>1) Finally got around to rebooting windfall yesterday morning.
>It was spring break, I had a little extra time, and I would get a 
>clean start.  

OK, good.

>Stopped the ldm.  
>
>2) There were some disk errors in the reboot, presumably indicating
>that it was a good thing to reboot, with assitance from local unix
>support ran fsck and then did a "clean" reboot (no errors).  Was led to 
>believe that things would be fine....  

Were any files of substance damaged?

>3) Decided to go ahead and del and remake the product queue and 
>then started the ldm.

OK, but this may not have been needed.

>4) Got preoccupied with other tasks (working on proposal, responding
>to old mail, and then spent the afternoon helping students working 
>on projects who needed data and need to work on mcidas using data 
>from archived case studies. ) In short, did not pay attention to
>real-time data acquisition..... 
>
>5) Before going home did a quick check of the ldm (looked at the
>product queue and /incoming/data/xcd to see that prods were decoding
>etc., and checked the ldm log, *but* failed to check the mcidas.log
>and amazingly didn't look at the webpage to notice there were any problems
>with images not updating. (yes, I am an idiot.)
>
>6) Came in this morning and realized nothing was being updated. (no ppbatch
>running again!).  Looked at the current mcidas.log and found that
>everything looked okay (although every instance of any mcidas decoder action
>(gunrv2, ltwoa3, etc.) had the message of unsetting MCPATH, e.g.:
>
>Mar 19 02:17:22 lwtoa3[7897]: Starting Up
>Mar 19 02:17:22 lwtoa3[7897]: unsetting MCPATH environment variable
>Mar 19 02:17:22 lwtoa3[7897]: changing to directory /incoming/data/mcidasd
>Mar 19 02:17:22 lwtoa3[7897]: decoding "LWTOA3 122 DIALPROD=U9 99078  21600"
>Mar 19 02:17:22 lwtoa3[7897]: PRODUCT CODE=U9          99078       021600
>Mar 19 02:17:34 lwtoa3[7897]:  Done -- AREA= 120
 ...

I think that this is being written because you have verbose logging turned on.
I recommend that you turn this off (remove the -v flags from the ldm-mcidas
decoder invocations in pqact.conf).

>But, when I check the logs, these products are in and seem complete, etc.
>
>windfall: /home/ldma/logs $ ls -al /incoming/data/mcidasd/MD*8
>-rw-rw-rw-   1 ldma     mcidas   4224864 Mar 19 11:31 /incoming/data/mcidasd/M
> DXX0008
>-rw-rw-rw-   1 ldma     mcidas    192408 Mar 19 09:31 /incoming/data/mcidasd/M
> DXX0018
>-rw-rw-rw-   1 ldma     mcidas    160448 Mar 18 21:32 /incoming/data/mcidasd/M
> DXX0028
>windfall: /home/ldma/logs $ ls -al /incoming/data/mcidasd/AREA0120
>-rw-rw-rw-   1 ldma     mcidas   2438976 Mar 18 21:17 /incoming/data/mcidasd/A
> REA0120
>
>So everything look pretty good.  But, no post-process batch files
>are running, and the ROUTEPP.LOG file doesn't exist?

Sounds like the same problem again.

>Nothing is suspended in the ROUTE LISTING, and products are logging.

When you say logging, I assume that you mean updating ROUTE.SYS?

>I have checked the PATH's

>
>windfall: /home/ldma/util $ echo $PATH
>/home/ldma/util:/home/ldma/ldm-mcidas-7.4.0/bin:/home/ldma/bin:/home/ldma/etc:
>...

OK, this is the PATH for your 'ldma' user.

>And in the file /home/ldma/util/batch.k the following is set:
>#             MCLOG  - fully qualified pathname of the file in which 
>#                      BATCH commands will be logged
>#
>#             PATH -   a search PATH that contains the directory of
>#                      McIDAS executables.  NOTE that in this PATH
>#                      the McIDAS bin directory MUST be before the
>#                      directory in which this file, batch.k, exists
>#                      (do not confuse this with the seemingly contrary
>#                      statement about PATHs above)
>
>
>MCDATA=/home/mcidas/workdata
>MCLOG=/home/mcidas/workdata/ROUTEPP.LOG
>MCPATH=${MCDATA}:/home/mcidas/data:/home/mcidas/help
>
>PATH=/home/mcidas/bin:/home/ldma/ldm-mcidas-7.4.0/bin:/usr/openwin/bin:/opt/SU
> NWspro/bin:/usr/ccs/bin:/usr/bin:/usr/local/bin:/usr/ucb

OK, this looks correct.  You could remove the ldm-mcidas directory from
this PATH completely.  There are no programs that will be needed by this
Bourne shell script invocation or any of its children.

>The shell-script is executable by user ldma:
>
>-rwxr-xr--   1 ldma     mcidas      6518 Mar 16 13:48 batch.k*
>The location of the log file should be writable (in fact its totally
>open):
>drwxrwxrwx   3 mcidas   mcidas      5632 Mar 19 10:44 workdata/

It is "totally open" since I changed its access permission the last time
I was on your machine trying to get things to run.  I failed to change
it back to 775.

>I can "make" a LOG file there:
>windfall: /home/ldma $ whoami
>ldma
>windfall: /home/ldma $ touch /home/mcidas/workdata/ROUTEPP.LOG
>
>-rw-rw-r--   1 ldma     mcidas         0 Mar 19 12:02 /home/mcidas/workdata/RO
> UTEPP.LOG

Ok, good.

>Wait a minute, maybe thats the problem?  Why wasn't there already a
>ROUTEPP.LOG file?  I didn't remove it when I stopped the ldm and
>rebooted the system?  

I don't think that ROUTEPP.LOG needs to exist for logging to proceed.
The stdout and stderr redirections in batch.k say to append to this
file.  If the file doesn't exist, it gets created on the first write
attempt.

>So I was trying to remember how the ROUTEPP.LOG file gets scoured and I
>looked at the crontab listing for user ldma, it had the following entry
>(which I just commented out)
>0 18 * * * /usr/bin/rm /home/ldma/logs/*.stats
>0 18 * * * /usr/bin/rm /home/ldma/logs/mcidas.log
>55 20 * * * /home/mcidas/bin/mcscour.sh
>04 0 * * * cp /incoming/data/front/ASUS1* /p2/prophet98/FRONT
>05 0 * * * /usr/bin/rm /incoming/data/front/ASUS1*
>05 0 * * * /usr/bin/rm /incoming/data/front/FSUS2*
>#10 0 * * * /usr/bin/rm /home/mcidas/workdata/ROUTEPP.LOG

OK, I put scouring of ROUTEPP.LOG into mcscour.sh, so the extra invocation
in 'ldma's cron was redundant.

>This might explain why it was gone (maybe?)

It would explain why it would get deleted, but not why it was not getting
recreated.

>But, why didn't it come back?

I don't know.

>How is the log file supposed to get scoured by the
>way (it does grow rather long), this approach must not be correct? 

Through the entry in mcscour.sh:

MCPATH=$MCPATH PATH=$PATH LD_LIBRARY_PATH=$LD_LIBRARY_PATH mcenv << 'EOF'

doqtl.k  1 70 2
qrtmdg.k GRID 5001 5400 1
doqtl.k 71 80 2
doqtl.k 81 90 2
doqtl.k 91 100 2
delwxt.k 1 10
igu.k DEL 132
lwu.k DEL VIRT9001
lwu.k DEL VIRT9002
lwu.k DEL ROUTEPP.LOG
batch.k \"BROADCAST.BAT
exit

EOF

Notice the 'lwu.k DEL ROUTEPP.LOG' invocation.

>I tried (as user ldma) to just make a test log using the commands
>in the batch.k script, and it works fine.
>
>windfall: /home/ldma/util $ MCLOG=/home/mcidas/workdata/test.log              
>               
>windfall: /home/ldma/util $ ls -al /home/mcidas/workdata/test.log
>/home/mcidas/workdata/test.log: No such file or directory
>windfall: /home/ldma/util $ echo $MCLOG                          
>/home/mcidas/workdata/test.log
>windfall: /home/ldma/util $ echo "testing" >> $MCLOG
>windfall: /home/ldma/util $ ls -al /home/mcidas/workdata/test.log
>-rw-rw-r--   1 ldma     mcidas         8 Mar 19 12:20 /home/mcidas/workdata/te
> st.log
>
>windfall: /home/ldma/util $ more /home/mcidas/workdata/test.log 
>testing

OK, looks correct.

>So, I don't get it.  Why can't batch.k make the ROUTEPP.LOG
>file as it should?

I a betting that the first routine being run by 'mcenv' is core dumping
thereby stopping all further processing.  This is supported by the fact
that the 'core' file that I looked at the last time I was on your machine
was from GU.  'mcenv' runs GU after creating the frames for the mini-session.

>Indeed, maybe the question is wrong,
>maybe it should be: is the batch.k file ever even getting run?  
>I am unsure how to check that.   

I suggest adding a information line in batch.k before the mcenv invocation
and after the stdout/stderr redirection:


# Send all messages to the log file and start the process
exec 2>> $MCLOG 1>&2

time=`TZ=UTC date +%y%j.%H%M%S`
echo "Starting $@ at $time"

This message will log when batch.k runs to ROUTEPP.LOG.

>I know that AREA0133 just came in (99078 1717) and it should have
>fired IR9.BAT.   The ROUTEPP.LOG file exists (because I touched
>it a while ago writing this message), and yet nothing got
>logged?  So, it would seem the post-batch processing isn't even running.
>
>DAMN, what am i forgetting?

Probably nothing.  Again, if something that mcenv tries to run dumps
core, then all processing will stop at that point.  The question then
becomes one of why that(thos) routine(s) are dying.

>This stuff is really driving me nuts.  I AM trying to maintain
>(diagnose) things myself, but this is where I end up:
>
>nowhere.

There is obviously something amiss on your system.  At what level things
have gone bad is unknown.  The strange thing I noticed the last time I
was on your system was that lingering shared memory segment.

>read through last mail on ppbatch problems, and realized
>I hadn't looked for what had happenened before when the ppbatch 
>files fail:  core dumps and other junk files corrupting the incoming
>data directory.  
>
>Yup, they're there:
>
>
>windfall: /incoming/data/mcidasd $ file core
>core:           ELF 32-bit MSB core file SPARC Version 1, from 'gu.k'

OK.  GU dying is causing the mcenv invocation to stop and, therefore,
the batch.k processing to die.

>windfall: /incoming/data/mcidasd $ ls -al co*
>-rw-rw-rw-   1 ldma     mcidas   1627920 Mar 19 12:38 core
>
>stopped the ldm.
>found all the files written at the time of the core file:
>windfall: /incoming/data/mcidasd $ ls -al | grep 'Mar 19 12:38'
>drwxrwxr-x   2 ldma     mcidas      6144 Mar 19 12:38 ./
>-rw-rw-rw-   1 ldma     mcidas      4020 Mar 19 12:38 ALLOC.WWW
>-rw-rw-rw-   1 ldma     mcidas   2422256 Mar 19 12:38 AREA0147
>-rw-rw-rw-   1 ldma     mcidas      6528 Mar 19 12:38 FRAMENH.001
>-rw-rw-r--   1 ldma     mcidas     14592 Mar 19 12:38 ROUTE.SYS
>-rw-rw-rw-   1 ldma     mcidas    307200 Mar 19 12:38 SYSIMAGE.SAV
>-rw-rw-rw-   1 ldma     mcidas     45056 Mar 19 12:38 TERMCHAR.001
>-rw-rw-rw-   1 ldma     mcidas   1627920 Mar 19 12:38 core

The FRAMENH.001, SYSIMAGE.SAV, and TERMCHAR.001 files will cause a problem
Did you check to make sure that there are no others in any other directory
in the 'mcidas' user's MCPATH?  To do this, logon as 'mcidas', cd to
workdata, and run:

dmap.k \*.001
dmap.k SYSIMAGE.SAV

>deleted the junk (but not the ROUTE.SYS and AREA0147).
>
>checked for any of these (or any other suspicious looking files)
>in mcidas/workdata too.  none.

Check all MCPATH directories as I indicate above.

>restarted the ldm.
>ran top and watched until batch.k ran.  First pp-batch to fire was 
>MDR, which ran fine, updating the radar.gif on the webpage.

So, GU core dumping is the culprit and the supporting evidence is the
*.001 and SYSIMAGE.SAV files in the /incoming/data/mcidasd directory.

>so, looks like things are working.
>and maybe i'm not a total idiot (though i am still frustrated
>by how much time this sucks up)
>but, there is a big unresolved question:

This stuff is not rocket science.  The fact that these are bombing and
nobody is mucking with files indicates that there is some other problem
at work.  What it is, I can't say; sorry.

>When i rebooted the system yesterday, and then restarted the
>ldm, there must have already been a problem.

I agree.

>I looked more
>carefully at the web page and it had stopped updating products
> *before* I had stopped and restarted the ldm (and done the
>reboot).

Is anybody loggin on as 'mcidas' and running a session?  Does anybody
else's session have the /home/mcidas/workdata directory in its
MCPATH?  I have a sneaking suspicion that the problems are being caused
by someone not playing by the rules or some sort of systematic OS
thing.

>The last good products posted at 15:00 (10am local time)
>but I did the reboot about 11 local time.  So, if there was already 
>a core file and all these bad files (FRAMNH, TERMCHAR) sitting around 
>in /incoming/data/mcidasd I can see how would that have 
>compromised the correct running of batch processes.

Right.

>But, why wasn't 
>there a log file (even if the batch was running and
>failing, doesn't it do any logging?).

See above.

>Actually, I guess its possible
>that the log file would have been there when I restarted the
>system, but then it got scoured last night (0:10 local time) and
>never got remade....If a pp-batch tries to run and fails because
>it has previous errors in its way (a core file and the things
>it leaves around), should it still log something?

Not necessarily.  It depends on what routine is bombing out.  Since it
looks like it is always GU on your system, and since GU is run by mcenv,
then it makes sense that nothing would get logged.

>Unless you can think of something that might be wrong with our setup
>here, I guess I can only watch for the post processes to remain stable....

I can't think of anything off of the top of my head.  I will have to
check with Mike Schmidt to see if there is a OS patch for shated memory
that needs to be installed.

Tom
**************************************************************************** <
Unidata User Support                                    UCAR Unidata Program <
(303)497-8644                                                  P.O. Box 3000 <
address@hidden                                   Boulder, CO 80307 <
---------------------------------------------------------------------------- <
Unidata WWW Service                        http://www.unidata.ucar.edu/      <
**************************************************************************** <