[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20020516: help with ldm (complete message)



>From: "Jennie L. Moody" <address@hidden>
>Organization: UVa
>Keywords: 200205161823.g4GINca18632 LDM

Hi Jennie,

>Well, it was inevitable that eventually I would have to start
>paying attention to things and trying to fix problems.

Yup.  Things were working pretty well awhile ago.  I worked with
Tony to cut down on the amount of grib data that gets decoded
with McIDAS-XCD.  This was crutial at the time since windfall
kept running out of disk space (in /p4 ?).

>Our webpage stopped updating yesterday, and it looks like we lost
>our connection to our upstream site.  Today I got on to see
>if I could just restart the ldm.  After realizing that I
>had to make a new password just to get in (I thought
>I new the old one?), I found that there is plenty
>I have forgotten.  I was thinking I could just
>stop the ldm, ldmadmin stop
>and then restart it
>ldmadmin start.

That is the correct sequence with the exception that you have
to wait until all LDM processes exit before restarting.

>But I get the message that there is still a server running:
>
>windfall: /usr/local/ldm/etc $ ps -lf -u ldm
> F S      UID   PID  PPID  C PRI NI     ADDR     SZ    WCHAN
>
>STIME TTY      TIME CMD
> 8 R      ldm   469   467 34  85 22 60d1c1f0  37463
>
>14:10:02 ?       942:46 pqact -d /usr/local/ldm -q /usr/loc
> 8 S      ldm   467     1  0  47 22 608c4010    275 608c4080
>
>14:10:01 ?        0:00 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 O      ldm   470   467 33  75 22 60d1c8b0  37452
>
>14:10:02 ?       928:12 pqbinstats -d /p4/logs -q /usr/loca
> 8 R      ldm   471   467 34  85 22 60d1b470  37472
>
>14:10:02 ?       962:03 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 S      ldm 20640 14588  0  51 20 60d1adb0    204 60d1ae20

It is strange to see pqact in the list of still active processes.
Did you check to see if there was still available disk space?  More
down below.

>13:49:00 pts/1    0:00 -ksh
>windfall: /usr/local/ldm/etc $ whoami
>ldm
>windfall: /usr/local/ldm/etc $ ldmadmin stop
>stopping the LDM server...
>LDM server stopped
>windfall: /usr/local/ldm/etc $ ps -lf -u ldm
> F S      UID   PID  PPID  C PRI NI     ADDR     SZ    WCHAN
>
>STIME TTY      TIME CMD
> 8 R      ldm   469   467 34  85 22 60d1c1f0  37463
>
>14:10:02 ?       942:59 pqact -d /usr/local/ldm -q /usr/loc
> 8 S      ldm   467     1  0  47 22 608c4010    275 608c4080
>
>14:10:01 ?        0:00 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 R      ldm   470   467 32  95 22 60d1c8b0  37452
>
>14:10:02 ?       928:24 pqbinstats -d /p4/logs -q /usr/loca
> 8 O      ldm   471   467 34  75 22 60d1b470  37472
>
>14:10:02 ?       962:16 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 S      ldm 20640 14588  0  51 20 60d1adb0    204 60d1ae20
>
>13:49:00 pts/1    0:00 -ksh
>
>So this didn't seem to do anything, using the dumb approach of
>thinking that some of these processes wouldn't stop if the
>delqueue wasn't run, I tried that (don't ask me why I thought
>this would work...the mental equivalent of pushing buttons)

At this point, I would forcably kill all processes that refuse
to die; verify that they are no longer running; delete and remake
the queue; and then restart.

>windfall: /usr/local/ldm $ ldmadmin stop
>stopping the LDM server...
>LDM server stopped
>windfall: /usr/local/ldm $ ldmadmin delqueue
>May 16 18:13:20 UTC windfall.evsc.Virginia.EDU : delete_pq: A
>
>server is running, cannot delete the queue

Right.  The processes that access the queue will have a lock on it,
so you shouldn't be able to delete it.

>So, I don't know whats up.....sadly, I need a refresher course,
>but in the meantime, maybe someone out there can tell me what to
>do, or jump in here....I will happily share the new access info
>
>for user ldm...

OK, I just logged on.  What I did was:

windfall: /usr/local/ldm $ ps -u ldm
   PID TTY      TIME CMD
   469 ?       966:43 pqact
   467 ?        0:00 rpc.ldmd
   470 ?       951:45 pqbinsta
   471 ?       986:39 rpc.ldmd
 25380 pts/8    0:00 ksh
windfall: /usr/local/ldm $ kill -9 469 467 470 471
windfall: /usr/local/ldm $ ldmadmin delqueue
windfall: /usr/local/ldm $ ldmadmin mkqueue
windfall: /usr/local/ldm $ ldmadmin start
windfall: /usr/local/ldm $ ps -u ldm
   PID TTY      TIME CMD
 25487 ?        0:00 ingetext
 25452 ?        0:00 ingetext
 25467 ?        0:00 startxcd
 25485 ?        0:00 ingebin.
 25468 ?        0:00 dmsfc.k
 25448 ?        0:00 pqbinsta
 25446 ?        0:00 startxcd
 25470 ?        0:00 dmgrid.k
 25449 ?        0:00 rpc.ldmd
 25453 ?        0:00 ingebin.
 25447 ?        0:00 pqact
 25469 ?        0:00 dmraob.k
 25445 ?        0:00 rpc.ldmd
 25380 pts/8    0:00 ksh

windfall: /usr/local/ldm $ ldmadmin watch
(Type ^D or ^C when finished)
May 16 18:47:26 pqutil:    25724 20020516174856.095     HDS 145  YHWB90 KWBG 
161700 /mRUC2
May 16 18:47:26 pqutil:    19326 20020516174856.200     HDS 150  YTWB90 KWBG 
161700 /mRUC2
May 16 18:47:26 pqutil:    19326 20020516174856.298     HDS 155  YVWB85 KWBG 
161700 /mRUC2
May 16 18:47:26 pqutil:    19326 20020516174856.415     HDS 160  YUWB90 KWBG 
161700 /mRUC2
May 16 18:47:26 pqutil:    19326 20020516174856.558     HDS 165  YVWB90 KWBG 
161700 /mRUC2
May 16 18:47:27 pqutil:     4821 20020516174856.583     HDS 167  SDUS84 KLZK 
161744 /pDPALZK
 ...

So, windfall is again feeding from ldm.meteo.psu.edu.  The 'ldmadmin watch'
shows that products are being received as expected, so LDM-related
things (including McIDAS-XCD decoders) are running.  How this relates
to your web page generation of products we can't say, but presumably
they will come back as data gets decoded and cron-initiated scripts run.

Since things were kinda messed up, I took the opportunity to do some further
cleaning up:

ldmadmin stop
<verify that all LDM processes exit>
cd ~ldm/.mctmp
/bin/rm -rf *

This cleans up subdirectories created by LDM-initiated McIDAS processes.
there were a few left in .mctmp that were fairly old (listing done before
the 'ldmadmin stop' above):

windfall: /usr/local/ldm/.mctmp $ ls -alt
total 46
drwx------  23 ldm      mcidas       512 May 16 14:54 ./
drwx------   2 ldm      mcidas       512 May 16 14:46 2902/
drwx------   2 ldm      mcidas       512 May 16 14:46 3300/
drwx------   2 ldm      mcidas       512 May 16 14:46 3601/
drwxr-xr-x  11 ldm      mcidas      1024 May 16 14:46 ../
drwx------   2 ldm      mcidas       512 May 15 13:40 66406/
drwx------   2 ldm      mcidas       512 May 14 07:20 702/
drwx------   2 ldm      mcidas       512 May 14 07:20 801/
drwx------   2 ldm      mcidas       512 May 12 22:10 401/
drwx------   2 ldm      mcidas       512 May 12 22:10 302/
drwx------   2 ldm      mcidas       512 May 12 21:50 202/
drwx------   2 ldm      mcidas       512 Apr 25 06:50 301/
drwx------   2 ldm      mcidas       512 Apr 23 14:50 571600/
drwx------   2 ldm      mcidas       512 Apr 23 14:50 5801/
drwx------   2 ldm      mcidas       512 Apr 23 14:50 778602/
drwx------   2 ldm      mcidas       512 Apr  4 14:26 570300/
drwx------   2 ldm      mcidas       512 Apr  4 14:26 777502/
drwx------   2 ldm      mcidas       512 Mar 19 13:20 4501/
drwx------   2 ldm      mcidas       512 Feb 18 12:37 1090511/
drwx------   2 ldm      mcidas       512 Feb 16 07:37 100716/
drwx------   2 ldm      mcidas       512 Apr 19  2001 683109/
drwx------   2 ldm      mcidas       512 Feb 19  2001 43502/
drwx------   2 ldm      mcidas       512 Feb 19  2001 84901/

The old ones in the list (ones previous to May 16) are the result of
aborted processes.  Cleaning them up is a _good thing_ :-)

After making sure that there were no shared memory segments still
allocated to 'ldm' (again, McIDAS use), I restarted the LDM:

ldmadmin start

Things appear to be running smoothly, and the load on windfall is low:

last pid: 27157;  load averages:  0.26,  0.35,  0.91                   15:04:54
66 processes:  65 sleeping, 1 on cpu
CPU states: 76.2% idle, 18.0% user,  3.8% kernel,  2.0% iowait,  0.0% swap
Memory: 384M real, 5920K free, 58M swap in use, 1266M swap free

>thanks in advance, Tom or Anne or whomever....

Tom here...

>(by the way, this isn't any really time-sensitive
>issue, no operational or quasi-operational work
>going on here)

No problem.  This was a quick fix.

Talk to you later...

Tom

>From address@hidden Thu May 16 18:51:27 2002

Thanks so much Tom!

My instict was to just kill the processes, so I don't
know why I didn't, just confusion I guess.

Jennie