[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20020822: ldmd won't stay running



John C Nordlie wrote:
> 
> Ok, I've hacked the ldmd.conf and ldmadmin files to include
> the logfile override.  I also rebooted the machine and zero'd
> the log file.  Here is the output of one attempt to start
> the ingestor with 'ldmadmin start':
> 

Hi John,

Thanks!  This is helpful.  I see a few things going on:


> Aug 23 15:52:15 rpc.ldmd[179]: Starting Up (built: Jun 12 2002 15:26:16)
> Aug 23 15:52:15 amelia[183]: run_requester: Starting Up:
> amelia.geol.iastate.edu
> Aug 23 15:52:15 amelia[183]: run_requester: 20020823145215.937 TS_ENDT
> {{HDS,  ".*"},{MCIDAS,  ".*"},{IDS|DDPLUS,  ".*"}}
> Aug 23 15:52:15 remus[185]: run_requester: Starting Up: remus.rwic.und.edu
> Aug 23 15:52:15 remus[185]: run_requester: 20020823145215.940 TS_ENDT
> {{NLDN,  ".*"}}
> Aug 23 15:52:15 129.15.194.231[187]: run_requester: Starting Up:
> 129.15.194.231
> Aug 23 15:52:15 129.15.194.232[188]: run_requester: Starting Up:
> 129.15.194.232
> Aug 23 15:52:15 129.15.194.232[188]: run_requester: 20020823145215.944
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 129.15.194.233[189]: run_requester: Starting Up:
> 129.15.194.233
> Aug 23 15:52:15 129.15.194.233[189]: run_requester: 20020823145215.946
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 129.15.194.234[190]: run_requester: Starting Up:
> 129.15.194.234
> Aug 23 15:52:15 129.15.194.234[190]: run_requester: 20020823145215.947
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 129.15.194.236[192]: run_requester: Starting Up:
> 129.15.194.236
> Aug 23 15:52:15 129.15.194.236[192]: run_requester: 20020823145215.950
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 129.15.194.237[193]: run_requester: Starting Up:
> 129.15.194.237
> Aug 23 15:52:15 129.15.194.237[193]: run_requester: 20020823145215.951
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 129.15.194.238[194]: run_requester: Starting Up:
> 129.15.194.238
> Aug 23 15:52:15 129.15.194.238[194]: run_requester: 20020823145215.953
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 aeolus[184]: run_requester: Starting Up: aeolus.ucsd.edu
> Aug 23 15:52:15 aeolus[184]: run_requester: 20020823145215.960 TS_ENDT
> {{NNEXRAD,  "/p......"},{FNEXRAD,
> "/p...(BIS|MBX|MVX|ABR|FSD|UDX|DLH|MPX)"}}
> Aug 23 15:52:15 dns2[186]: run_requester: Starting Up: dns2.cmc.ec.gc.ca
> Aug 23 15:52:15 dns2[186]: run_requester: 20020823145215.962 TS_ENDT
> {{GEM,  ".*"}}
> Aug 23 15:52:15 129.15.194.231[187]: run_requester: 20020823145215.943
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 129.15.194.235[191]: run_requester: Starting Up:
> 129.15.194.235
> Aug 23 15:52:15 129.15.194.235[191]: run_requester: 20020823145215.966
> TS_ENDT {{ANY,  ".*"}}
> Aug 23 15:52:15 remus[185]: FEEDME(remus.rwic.und.edu): OK
> Aug 23 15:52:15 pqact[181]: Starting Up
> Aug 23 15:52:15 pqbinstats[180]: Starting Up (179)
> Aug 23 15:52:16 pqsurf[182]: Starting Up (179)
> Aug 23 15:52:16 pqsurf[182]: pq_open failed:
> /usr/local/ldm/data/pqsurf.pq: No such file or directory


Looks like you're trying to run pqsurf without a pqsurf queue.  The LDM
is coded to exit when one of its children exits.   Below we can see that
the parent rpc.ldmd, PID #179, has decided to exit, thus terminating the
whole process group.



> Aug 23 15:52:16 pqsurf[182]: Exiting
> Aug 23 15:52:16 rpc.ldmd[179]: Exiting
> Aug 23 15:52:16 remus[185]: Exiting
> Aug 23 15:52:16 pqbinstats[180]: Exiting
> Aug 23 15:52:16 pqsurf[182]: waitpid: No child processes
> Aug 23 15:52:16 pqsurf[182]: Number of products 0
> Aug 23 15:52:16 pqsurf[182]: Number of observations 0
> Aug 23 15:52:16 pqsurf[182]: Number of dups 0
> Aug 23 15:52:16 rpc.ldmd[179]: Terminating process group
> Aug 23 15:52:16 rpc.ldmd[179]: child 182 exited with status 1
> Aug 23 15:52:16 129.15.194.233[189]: FEEDME(129.15.194.233): reclass:
> 20020823145215.946 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:16 129.15.194.234[190]: FEEDME(129.15.194.234): reclass:
> 20020823145215.947 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:16 129.15.194.236[192]: FEEDME(129.15.194.236): reclass:
> 20020823145215.950 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:16 pqact[181]: Exiting
> Aug 23 15:52:16 129.15.194.235[191]: FEEDME(129.15.194.235): reclass:
> 20020823145215.966 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:16 129.15.194.238[194]: FEEDME(129.15.194.238): reclass:
> 20020823145215.953 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:16 129.15.194.233[189]: FEEDME(129.15.194.233): OK
> Aug 23 15:52:16 129.15.194.233[189]: Exiting
> Aug 23 15:52:16 129.15.194.234[190]: FEEDME(129.15.194.234): OK
> Aug 23 15:52:16 129.15.194.234[190]: Exiting
> Aug 23 15:52:16 129.15.194.236[192]: FEEDME(129.15.194.236): OK
> Aug 23 15:52:16 129.15.194.236[192]: Exiting
> Aug 23 15:52:16 129.15.194.238[194]: FEEDME(129.15.194.238): OK
> Aug 23 15:52:16 129.15.194.238[194]: Exiting
> Aug 23 15:52:16 129.15.194.235[191]: FEEDME(129.15.194.235): OK
> Aug 23 15:52:16 129.15.194.235[191]: Exiting
> [195] 020823/1052 [DC 3]  Starting up.
> [195] 020823/1052 [DC 5]  Normal termination.
> [195] 020823/1052 [DC 2]  Number of bulletins read and processed: 0
> [195] 020823/1052 [DC 6]  Shutting down.
> Aug 23 15:52:18 129.15.194.231[187]: FEEDME(129.15.194.231): reclass:
> 20020823145215.943 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:18 amelia[183]: FEEDME(amelia.geol.iastate.edu): OK
> Aug 23 15:52:18 amelia[183]: Exiting
> Aug 23 15:52:27 129.15.194.231[187]: FEEDME(129.15.194.231): OK
> Aug 23 15:52:27 129.15.194.231[187]: Exiting
> Aug 23 15:52:36 129.15.194.237[193]: FEEDME(129.15.194.237): reclass:
> 20020823145215.951 TS_ENDT {{NEXRD2,  ".*"}}
> Aug 23 15:52:41 dns2[186]: FEEDME(dns2.cmc.ec.gc.ca): can't contact
> portmapper: RPC: Timed out
> Aug 23 15:52:41 aeolus[184]: FEEDME(aeolus.ucsd.edu): can't contact
> portmapper: RPC: Timed out

I wonder about these.  I don't know if this is a separate problem or if
it's related to the termination of the process group, although I suspect
the latter.  I would try an ldmping to these sites to ensure that the
RPC call won't time out.  



> Aug 23 15:53:11 dns2[186]: Exiting
> Aug 23 15:53:11 aeolus[184]: Exiting
> Aug 23 15:53:15 129.15.194.232[188]: FEEDME(129.15.194.232): select: RPC:
> Timed out
> Aug 23 15:53:19 129.15.194.237[193]: h_clnt_call: 129.15.194.237: FEEDME:
> time elapsed  43.245131
> Aug 23 15:53:19 129.15.194.237[193]: FEEDME(129.15.194.237): OK
> Aug 23 15:53:19 129.15.194.237[193]: Exiting
> Aug 23 15:53:45 129.15.194.232[188]: Exiting
> 

So, please fix the pqsurf problem and let me know what happens. 

This sounds different than when you said the LDM would run for a few
hours and then quit.   Perhaps something else is also going on...

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************