[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LDM mystery



Bob,

The top-level LDM server appears to be hung, currently.  In order to
discover where, I've asked Sarah to have root use the strace(1) utility.
If that fails, then I think we'll need to install a version of the LDM
that has debugging enabled and execute it in an environment that
supports core dumps from programs that are owned by root and setuid.

Regards,
Steve Emmerson

------Original Message

>Date: Wed, 31 Aug 2005 17:01:11 +0000
>From: Bob Lipschutz <address@hidden>
>Organization: NOAA/NWS/FSL
>To: Steve Emmerson <address@hidden>
>Subject: LDM mystery

Thanks for working with us on our ldm problem.  Unfortunately,
we seem to be running out of things to try, so we're very
open to any suggestions.  

To date, we've tried installing various versions of ldm with
various kernels, with no change in the result.  We see this
same behavior with several different hosts, although others
have been working for us without any problems.

Here's what we are seeing as of 8/31 16:36Z.... output from
netstat, ps, ldmping, rpcinfo, and ldm server and client logs.

Note that the one zombie process did not get reaped from yesterday...
this suggests to me that the parent process is not getting back to
service its children.  Also, when we see any connections in SYN_RECV
state, the server no longer accepts new connections (although
existing connections remain up and funtioning).  The condition
appears in both the server and client logs, with a nullproc_6
timeout on the server and "No IS_ALIVE reply" timeout on the
client.

Thanks again!

    Bob Lipschutz
    303-497-6636

eldmf1:/usr/local/ldm/etc> netstat
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52277   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52281   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52268   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52258   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52267   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52257   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52300   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52270   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:43990    
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52298   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52379   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:43991    
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52259   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52389   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52279   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52382   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52278   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52269   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52381   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52405   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:43992    
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52280   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52390   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52397   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52260   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     kelpie.fsl.noaa.gov:52380   
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:43989    
SYN_RECV
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:40063    
ESTABLISHED
tcp        0     28 eldmf1.fsl.noaa.gov:38204   prism.fsl.noaa.gov:388      
ESTABLISHED
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:40066    
ESTABLISHED
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:40065    
ESTABLISHED
tcp        0      0 eldmf1.fsl.noaa.gov:388     yampa.fsl.noaa.gov:40064    
ESTABLISHED
<snip>


eldmf1:/usr/local/ldm/etc> ps -efl | grep rpc.ldm
5 S rtoper   23317     1  0  75   0 -   763 322475 Aug30 ?        00:00:00 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23319 23317  0  75   0 -  4749 -      Aug30 ?        00:02:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23325 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:05 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23326 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:11 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23327 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:05 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23328 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:04 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23331 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23332 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:12 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23333 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23334 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23335 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23336 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23337 23317  0  75   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23338 23317  0  75   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23339 23317  0  75   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23340 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23341 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23342 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23343 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23344 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23345 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:02 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23346 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 S rtoper   23347 23317  0  76   0 -   927 rt_sig Aug30 ?        00:00:03 
rpc.ldmd -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
1 Z rtoper   24609 23317  0  75   0 -     0 exit   Aug30 ?        00:00:00 
[rpc.ldmd] <defunct>


eldmf1:/usr/local/ldm/logs> ldmping -i 3 localhost
Aug 31 16:44:21      State    Elapsed Port   Remote_Host           rpc_stat
Aug 31 16:44:31   H_CLNTED   9.998874  388   localhost    select: RPC: Timed out

eldmf1:/usr/local/ldm/logs>  /usr/sbin/rpcinfo -n 388 -t eldmf1.fsl.noaa.gov 
300029 6
rpcinfo: RPC: Timed out
program 300029 version 6 is not available


>From ldmd.log on server (eldmf1):

Aug 30 17:37:09 eldmf1 yampa(feed)[23326]: up6.c:287: nullproc_6() failure to 
yampa.fsl.noaa.gov: RPC: Timed out
Aug 30 17:37:09 eldmf1 yampa(feed)[23326]: Delay: 59.9919 sec
Aug 30 17:37:09 eldmf1 yampa(feed)[23326]: pq_sequence(): 
time(insert)-time(create): 0.0149 s
Aug 30 17:37:09 eldmf1 yampa(feed)[23326]: Delay: 59.4061 sec
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: up6.c:287: nullproc_6() failure to 
yampa.fsl.noaa.gov: RPC: Timed out
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: Delay: 59.9922 sec
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: pq_sequence(): 
time(insert)-time(create): 0.0149 s
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: Delay: 59.4064 sec
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: pq_sequence(): 
time(insert)-time(create): 0.0102 s
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: Delay: 56.3724 sec
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: pq_sequence(): 
time(insert)-time(create): 0.0154 s
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: Delay: 55.3660 sec
Aug 30 17:37:09 eldmf1 yampa(feed)[23327]: pq_sequence(): 
time(insert)-time(create): 0.0111 s
Aug 30 17:37:09 eldmf1 kelpie(feed)[23340]: up6.c:287: nullproc_6() failure to 
kelpie.fsl.noaa.gov: RPC: Timed out
Aug 30 17:37:09 eldmf1 kelpie(feed)[23343]: up6.c:287: nullproc_6() failure to 
kelpie.fsl.noaa.gov: RPC: Timed out


>From ldmd.log on client (kelpie):

Aug 30 17:36:35 kelpie eldmf1[29755]: INFO: requester6.c:105: No IS_ALIVE reply 
from upstream LDM on eldmf1.fsl.noaa.gov; assuming sending LDM is alive; 
requester6.c:105: RPC: Timed out
Aug 30 17:36:35 kelpie eldmf1[29755]: Upstream LDM is alive.  Waiting...
Aug 30 17:36:37 kelpie eldmf1[29761]: INFO: requester6.c:105: No IS_ALIVE reply 
from upstream LDM on eldmf1.fsl.noaa.gov; assuming sending LDM is alive; 
requester6.c:105: RPC: Timed out
Aug 30 17:36:37 kelpie eldmf1[29761]: Upstream LDM is alive.  Waiting...
Aug 30 17:36:37 kelpie eldmf1[29760]: INFO: requester6.c:105: No IS_ALIVE reply 
from upstream LDM on eldmf1.fsl.noaa.gov; assuming sending LDM is alive; 
requester6.c:105: RPC: Timed out
Aug 30 17:36:37 kelpie eldmf1[29760]: Upstream LDM is alive.  Waiting...
Aug 30 17:36:37 kelpie eldmf1[29769]: INFO: requester6.c:105: No IS_ALIVE reply 
from upstream LDM on eldmf1.fsl.noaa.gov; assuming sending LDM is alive; 
requester6.c:105: RPC: Timed out

------End of Original Message