[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

19990309: Route Post Process Failure



>From: "Jennie L. Moody" <address@hidden>
>Organization: UVa
>Keywords: 199903091926.MAA12825 McIDAS ROUTE SYSIMAGE.SAV

Jennie,

>Hi.  I am having a bad day.  We stopped getting McIDAS data yesterday
>(after I encouraged my class to pay attention to the satellite imagery
>as this storm developed, natch), any way after trying to diagnose what
>might be wrong I finally found the problem was at our host's end.

You know that you could have used ADDE to look at imagery on other
people's machines ;-)

>o Jeff Wolfe fixed his system, and data started flowing.  

Good.

>I am working at home, but I could see that new Unidata-McIDAS feed
>files were sitting in the /incoming/data/mcidasd directory, so I
>thought things were going to be okay.
>
>o the mcidas.log file showed products were coming and and getting
>  logged 

There is a strange message in ~ldma/logs/mcidas.log, however:

Mar 09 23:32:38 lwtoa3[27414]: Starting Up
Mar 09 23:32:38 lwtoa3[27414]: unsetting MCPATH environment variable
Mar 09 23:32:38 lwtoa3[27414]: changing to directory /incoming/data/mcidasd
Mar 09 23:32:38 lwtoa3[27414]: decoding "LWTOA3 163 DIALPROD=UA 99068 233222"
Mar 09 23:32:38 lwtoa3[27414]: PRODUCT CODE=UA          99068       233222
Mar 09 23:32:40 lwtoa3[27414]:  Done -- AREA= 166
Mar 09 23:32:40 lwtoa3[27414]: Exiting

The line that is unsettling is "unsetting MCPATH environment variable".
I don't know if this is something to worry about, or simply a reflection
of your having verbose logging turned on.

>o a run of route.k LIST showed me that the routing table was being 
>  updated.  

So far, so good.

>However, I found there were core files being dumped by the ldma user
>(who runs the PPBatch files).

You don't say where.

>I went to look at the ROUTEPP.LOG file
>and it didn't exist?

Not good.

>I thought maybe having the McIDAS data just start
>flowing again when my ldm had been running might have caused some
>problem, so I stopped and restarted the ldm. (The clueless turn it off
>and turn it back on response, which I don't like).

Actually, this is not a bad strategy for the clued.

>Since Tom told me
>if you are restarting the ldm right away, there is no need to rebuild
>the queue,

Actually, my comment really had no time reference to it.  The only time
that you need to remake the queue is when it gets corrupted.  The reason
to _not_ remake the queue if you don't have to is that when you do remake
it, you then implicitly are requesting the past hour's worth of data from 
your upstream feed site.

>I didn't (though I had remade the queue earlier today, when
>I was swithching hosts, etc.  And my system had been receiving xcd
>products as I noted earlier.)

OK.

>Anyway:
>
>o I am still getting core dumps 

Where?  What file(s) is(are) dumping the core files?  To find this out,
type:

file core

>o the McIDAS products are coming in, but we are not 
>  "post processing" them.  

OK.

>o there is no ROUTEPP.LOG file to look at, so I am stumped

OK.

>Any help would be appreciated.  I would like to have stuff
>back up for class discussion tomorrow.  

I am on your machine as I write this.

>------------
>In a separate item, the only response to my needdata request so far has
>been from someone who archived gempak grids.  Is it insane to think I
>might be able to get these files from gempak back into GRIB or into
>some other form that McIDAS could read (of course I cannot do this here
>since I don't run GEMPAK...

It may not be insane, but it is somewhat demented :-)

>would like to, but I have my hands full
>just trying to keep on top of McIDAS/LDM along with all the things I am
>supposed to do).

I understand.

>-----------
>
>BTW: if its Tom reading this message,

I am.

>just so you don't think we have
>anything  really messed up on our system (like our clock) after my
>question yesterday, Owen and I are the ones who were messed up, not the
>system, wasn't there some adage about user's always wrong?...

Don wrote the note on my white board:  99% of all errors are user induced!!!

>Anyway, I
>am happy to hear what I/we have done wrong currently to cause our post
>processing to fail...it had been working up until we stopped receiving
>data from PSU, now that we have data flowing, we have no
>post-processing.

OK.

>If you are getting frustrated by my questions, you can perhaps imagine
>how frustrated I am about continually asking for assistance!

I'm not getting frustrated by the questions.  After all, it is my job.

>(By the way, even diagnosing things is a hassle working from home,

Hmm...  I now have a connection from Sugarloaf into the NCAR/UCAR RAS
(Remote Access Service) through a 28.8 modem.  I worked from home for
the first time yesterday afternoon/evening and found the experience
to be very enjoyable.

>cause my line is 30 minutes, timed, and I keep losing my connection.

So, no matter if you are using the line, it disconnects after 30
minutes?  This sucks!

>If there is any reason a phone call would help, I am at 804-977-0910.
>Thanks.
>
>>From address@hidden  Tue Mar  9 14:52:11 1999
>
>o I looked around and found that core dumped a bunch of files in
>  /incoming/data that shouldn't be there (ALLOC.WWW, FRAME.001,
>  TERMCHAR.001, SYSIMAGE.SAV) so I deleted them (recalling this was a
>  problem in the past when we had ppbatch failure)

I found the same files there also, so I deleted them.  The weird part
is that ALLOC.WWW never gets created there.  It is almost as though
MCPATH for the ROUTE PP BATCH session had the /incoming/data/mcidasd
directory as its _first_ directory.  If this is the case, then all
bets are off in terms of REDIRECTions, etc.

>o I stopped the lmd, then deleted these files, then found one of them
>  (FRAMENH.001) in workdata, and mv'd it to a new name (I kept it cause
>  I was trying to figure out from its time stamp what the hell was
>  going on)

I saw that one also.

>o I restarted the ldm
>
>o first mcidas file that came in after restarting was an MDR, triggering
>  MDR.BAT
>
>o the batch failed, dumped core and all these nasty files again, and
>  still didn't write anything to ROUTEPP.LOG, which clearly isn't
>  getting made.

OK.

>Just wanted to let whoever might log in (please do!) know that since my
>last message, I got in and tried to fix things up...but I don't have
>notes here from the last time we had pp-batch problems, and I cannot
>recall everything that was a problem.
>
What I did was:

o logon as 'mcidas'
o cd to workdata
o run DMAP:

  dmap.k \*.001

The listing I got showed that these files were in directories other
than /incoming/data/mcidasd:

windfall: /p0/users/mcidas/workdata % dmap.k \*.001
PERM      SIZE LAST CHANGED FILENAME          DIRECTORY
---- --------- ------------ ----------------- ---------
-r--      6528 Mar 09 16:09 FRAMENH.001       /home/mcidas/help
-rw-     58752 Mar 09 12:43 FRAMENH.001.keep? /home/mcidas/workdata
-r--     45056 Mar 09 16:09 TERMCHAR.001      /home/mcidas/help
110336 bytes in 3 files

Not good!

>I surrender, take away my Ph.D.  I do recall that everything gets hosed
>when these temp files get in the PATH for the ldm (or user mcidas), and
>every process after will keep finding them first.

You remember correctly.

>But, I don't know
>why they are dumping here in the first place.

I don't either.

>I loathe this
>aggravation, I should have spent this time grading, writing, etc.

I understand your frustration.

>At
>least its snowing, I'm going to go play with my kids in the snow and
>hope that someone out there can help.

Well...  And I am on your machine sniffing out bugs.  Where's the justice
in that ;-)

So, while I was sitting here, I experienced another instance of SYSIMAGE.SAV
being created in the /home/mcidas/workdata directory.  I suspect that this
is either due to bad settings in /home/ldma/util/batch.k or something
much worse.  I really don't suspect anything bad in /home/ldma/util/batch.k
since it has been working right along.  Something worse would require that
the machine be rebooted.

Further probing reveals that your 'ldma' user can't create files in
the /home/mcidas/workdata directory:

windfall: /home/ldma/util $ touch /home/mcidas/data/ROUTEPP.LOG
touch: /home/mcidas/data/ROUTEPP.LOG cannot create

Not being able to create files in that directory would account for the
log file not existing.  What other ripple effect it might have is not
known by me off of the top of my head.

Strangely enough, the permissions on /home/mcidas/workdata looked like
'ldma' should have been able to write there:

drwxrwxr-x   3 mcidas   mcidas      5632 Mar  9 18:50 workdata/

This shows read/write/execute for owner and group and read/execute for
world.  I quickly checked to make sure that 'ldma' and 'mcidas' are
still in the same group:

windfall: /p0/users/mcidas % id mcidas
uid=101(mcidas) gid=101(mcidas)
windfall: /p0/users/mcidas % id ldma
uid=100(ldma) gid=101(mcidas)

Since they are in the same group, the directory permissions above should
have been sufficient.  In order to try and get things working, I changed
the permissions to 777:

windfall: /p0/users/mcidas % chmod 777 workdata

I could then create /home/mcidas/workdata/ROUTEPP.LOG as 'ldma'.

As I continued to browse around looking for things that might be wrong,
I decided that it would be interesting to see how long this machine has
been up:

windfall: /home/mcidas/workdata % uptime
  7:27pm  up 61 day(s),  8:51,  4 users,  load average: 0.61, 0.21, 0.13

It may well be that some part of the operating system has gone south
and the machines needs rebooting!

One thing that I see that shouldn't have any ill effects are:

ls -l /home/ldma/SYSKEY.TAB
-rw-r--r--   1 ldma     mcidas     24000 May  4  1998 /home/ldma/SYSKEY.TAB

ls -l /home/ldma/etc/ROUTE.SYS
-rw-r--r--   1 ldma     mcidas      7168 May  4  1998 /home/ldma/etc/ROUTE.SYS

Why are these files there?

As a last ditch effort (shy of rebooting), I did the following as 'ldma':

o ldmadmin stop
o windfall: /incoming/data/mcidasd $ ipcs
  IPC status from <running system> as of Tue Mar  9 19:18:19 1999
  T         ID      KEY        MODE        OWNER    GROUP
  Message Queues:
  Shared Memory:
  m        200   00000000 --rw-------     ldma   mcidas
  Semaphores:

  windfall: /incoming/data/mcidasd $ cd /home/ldma/.mctmp

  windfall: /home/ldma/.mctmp $ ls -al
  total 12
  drwx------   5 ldma     mcidas       512 Mar  8 22:37 ./
  drwxrwxr-x  13 ldma     mcidas      2048 Mar  9 19:16 ../
  drwx------   2 ldma     mcidas       512 Feb 10 18:10 101/
  drwx------   2 ldma     mcidas       512 Feb 10 18:10 102/
  drwx------   2 ldma     mcidas       512 Feb  2 16:09 200/
  windfall: /home/ldma/.mctmp $ ps -eaf | grep mcenv
   jlm8h    89     1  0   Mar 08 pts/11   0:01 mcenv -k 73 -f 11@700x864 -f 
5@480x640 mctext -iw -c !@exec mcimage -igeometry 
    ldma 27853 27301  0 19:19:02 pts/23   0:00 grep mcenv

  windfall: /home/ldma/.mctmp $ ipcrm -m 200

  windfall: /home/ldma/.mctmp $ ipcs
  IPC status from <running system> as of Tue Mar  9 19:19:15 1999
  T         ID      KEY        MODE        OWNER    GROUP
  Message Queues:
  Shared Memory:
  Semaphores:

  windfall: /home/ldma/.mctmp $ ls
  ./    ../   101/  102/  200/
  windfall: /home/ldma/.mctmp $ /bin/rm -rf *
  windfall: /home/ldma/.mctmp $ ls
  ./   ../

o windfall: /home/ldma $ ldmadmin start
  starting the LDM server...
  the LDM server has been started

  windfall: /home/ldma $ ls /incoming/data/mcidasd
  ...

  <looking for *.001, SYSIMAGE.SAV, core, etc.  found none, so I moved on>


As the user 'mcidas':

o cd /home/mcidas/workdata

o route.k LIST

S Pd         Description         Range       Last      Received  Post Process C
- -- ------------------------- --------- ------------ ---------- ------------ -
  CI GOES-8/9 IR Composite       80-89   AREA0086     99069   35     none     3
  CV GOES-8/9 VIS Composite      90-99   AREA0092     99069   37     none     3
  CW GOES-8/9 H2O COMPOSITE      70-79   AREA0071     99069   38 H2OCOMP.BAT  3
  LD NLDN Lightning Flashes      71-71       none        none        none     3
  MA Surface MD data            default  MDXX0009     99069   30 SFC.BAT      3
  N1 GOES-8 IR/TOPO Composite   220-229  AREA0220     99069   35     none     3
  N2 GOES-8 VIS/TOPO Composite  230-239  AREA0232     99069   37     none     3
  N3 GOES-9 IR/TOPO Composite   240-249  AREA0243     99042 1718     none     3
  N4 GOES-9 VIS/TOPO Composite  250-259  AREA0252     99042 1626     none     3
  N5 MDR/TOPO Composite         260-269  AREA0267     99067 1006     none     3
  N6 Mollweide IR/TOPO Composi  270-279  AREA0275     99067 1031     none     3
  N7 GOES-8/9 IR/TOPO Composit  280-289  AREA0282     99069   35     none     3
  N8 GOES-8/9 VIS/TOPO Composi  290-299  AREA0296     99069   37     none     3
  NF Global Initialization Gri  101-106  GRID0101     99068 2237 GLOBAL.BAT   3
  NG Early Domestic Products      1-40   GRID0039     99068 1641 ADDGRID.BAT  3
  R1 Base Reflectivity Tilt 1   300-339      none        none        none     3
  R2 Base Reflectivity Tilt 2   340-379      none        none        none     3
  R3 Base Reflectivity Tilt 3   380-419      none        none        none     3
  R4 Base Reflectivity Tilt 4   420-459      none        none        none     3
  R5 Composite Reflectivity     460-499      none        none        none     3
  R6 Layer Reflect SFC-24 K ft  500-539      none        none        none     3
  R7 Layer Reflect 24-33 K ft   540-579      none        none        none     3
  R8 Layer Reflect 33-60 K ft   580-619      none        none        none     3
  R9 Echo Tops                  620-659      none        none        none     3
  RA Vertical Liquid H2O        660-699      none        none        none     3
  RB 1-hour Surface Rain Total  700-739      none        none        none     3
  RC 3-hour Surface Rain Total  740-779      none        none        none     3
  RD Storm Total Rainfall       780-819      none        none        none     3
  RE Radial Velocity Tilt 1     820-859      none        none        none     3
  RF Radial Velocity Tilt 2     860-899      none        none        none     3
  RG Radial Velocity Tilt 3     900-939      none        none        none     3
  RH Radial Velocity Tilt 4     940-979      none        none        none     3
  RI 248 nm Base Reflectivity   980-1019     none        none        none     3
  RJ Storm-Rel Mean Vel Tilt 1 1020-1059     none        none        none     3
  RK Storm-Rel Mean Vel Tilt 2 1060-1099     none        none        none     3
  RM Mandatory Upper Air MD da  default  MDXX0017     99067  232 MAN.BAT      3
  RS Significant Upper Air MD   default  MDXX0027     99067  233 SIG.BAT      3
  U1 Antarctic IR Composite     190-199  AREA0195     99069   11     none     3
  U2 FSL2 hourly wind profiler  default      none        none        none     3
  U3 Manually Digitized Radar   200-209  AREA0201     99069    6 MDR.BAT      3
s U4 Unidata-Wisconsin hourly   default      none        none    PROFILER.BAT 3
  U5 GOES-9 Western US IR       130-139  AREA0130     99069   27 IR9.BAT      3
  U6 FSL2 6-minute Wind profil  default      none        none        none     3
  U9 GOES-9 Western US VIS      120-129  AREA0122     99069   27 VIS9.BAT     3
  UA Educational Floater I      160-169  AREA0167     99069   31     none     3
  UB GOES-9 Western US H2O      170-179  AREA0172     99069   27 H2O9.BAT     3
  UC Educational Floater II      60-69   AREA0062     99069   33     none     3
  UI GOES-8 North America IR    150-159  AREA0155     99069   35 IR8.BAT      3
  UM Administrative Message     default      none        none        none     1
  UR Research Floater           180-189      none        none        none     3
  US Undecoded SAO Data         default  UNIDATAS     99069   31     none     1
  UV GOES-8 North America VIS   140-149  AREA0146     99069   37 VIS8.BAT     3
  UW GOES-8 North America H2O   210-219  AREA0216     99069   36 H2O8.BAT     3
  UX Mollweide Composite IR     100-109  AREA0100     99068 2231 MOLL.BAT     3
  UY Mollweide Composite H2O    110-119  AREA0113     99068 2236     none     3


At the time of the listing, the DAY was 99069, and the time was about
40 past the hour. This listing shows that the ROUTE PostProcessing is
once again working.  Furthermore, there are no core, *.001, SYSIMAGE.SAV,
etc. files in /incoming/data/mcidasd.  My guess, therefore, was that the
problem was caused by whatever processes had allocated the shared memory
segment with id '200'.  My removing this segment while the LDM was off
and my removing the directories under /home/ldma/.mcmtp apparently cleared
up the problem.

I'll have to think about this one to figure out if the problem was caused
by switching the LDM feed (!?).

Later...

Tom