[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #CUA-629523]: Re: dataportal not receiving data from tigge-ldm.ecmwf.int



Manuel,

> I have tried that from tigge-ldm and I get:
> ldm@tigge-ldm:~> /usr/sbin/rpcinfo -n 388 -t tigge-ldm.ecmwf.int 300029
> rpcinfo: RPC: Timed out
> program 300029 version 0 is not available

Well, at least this is consistent with Dataportal not be able to connect to 
Tigge-ldm.  You might run snoop(1) or tcpdump(1) in another window while you do 
this to diagnose the problem.

> > Manuel, verify that any firewall rules on Tigge-ldm will allow incoming 
> > connections to port 388 from an arbitrary, remote port.

> Last Monday, when a similar problem occurred, the only thing I did was
> to restart LDM (remember I had to kill some LDM processes that were not
> stopped gracefully by ldmadmin). This cleared the problem.
> So I'm reluctant to think it is network related, but more likely a
> process that is preventing those connections. It may have been a network
> glitch that got it into this state, though.

I think it's best if we discover the cause of the problem now to prevent it 
from reoccurring in the future.

> If I 'ps -fu ldm' on both tigge-ldm and tigge-portal, I get different
> results. On tigge-ldm:
> UID        PID  PPID  C STIME TTY          TIME CMD
> ldm      31408     1  0 Mar27 ?        00:00:00 vi stats.pl
> ldm      18258 18252  0 Apr05 ?        00:00:00 sshd: ldm@pts/0
> 
> ldm      18259 18258  0 Apr05 pts/0    00:00:00 -bash
> ldm      23339 23337  0 Apr10 ?        00:00:00 sshd: ldm@pts/1
> 
> ldm      23340 23339  0 Apr10 pts/1    00:00:00 -bash
> ldm      30862 30860  0 Apr11 ?        00:00:00 sshd: ldm@pts/6
> 
> ldm      30863 30862  0 Apr11 pts/6    00:00:00 -bash
> ldm      31903     1  0 Apr11 ?        00:04:13 pqact -f ANY -v -l
> log/ldmd.log -p missing etc/tigge_pqact.conf

I'm surprised that you're using pqact(1)'s "-l" option because that utility 
should log to the LDM log file by default.

> ldm      31905     1  0 Apr11 ?        00:00:06 /usr/bin/perl
> /usr/local/ldm/tigge/send
> ldm      31906     1  0 Apr11 ?        00:00:12 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm      31907     1  0 Apr11 ?        00:21:33 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm      32091     1  0 Apr11 ?        00:08:01 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm      32145     1  0 Apr11 ?        00:03:47 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm      32147     1  0 Apr11 ?        00:07:41 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf

That's odd.  The above indicates that 5 top-level LDM servers are running (the 
parent process ID for the LDM server is "1"; it's the PID of the LDM server for 
all upstream and downstream LDM child processes.  This should not occur and 
indicates a serious problem.

> ldm      21139 21137  0 Apr11 ?        00:00:00 sshd: ldm@pts/4
> 
> ldm      21140 21139  0 Apr11 pts/4    00:00:00 -bash
> ldm      18695     1  0 Apr12 ?        00:00:59 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm      22068 22066  0 09:35 ?        00:00:00 sshd: ldm@pts/2
> 
> ldm      22069 22068  0 09:35 pts/2    00:00:00 -bash
> ldm      31904     1  0 Apr11 ?        00:03:13 rtstats -h
> rtstats.unidata.ucar.edu
> ldm       2507 22069  0 19:54 pts/2    00:00:00 ps -fu ldm
> 
> while on tigge-portal:
> UID        PID  PPID  C STIME TTY          TIME CMD
> ldm      29317     1  0 Apr05 ?        00:00:00 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm      29318 29317  3 Apr05 ?        06:43:31 pqact -f EXP -p tigge
> etc/pqact.conf_tigge
> ldm      29321 29317  0 Apr05 ?        00:33:23 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm      29322 29317  0 Apr05 ?        00:33:11 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm      29323 29317  0 Apr05 ?        00:20:27 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm      29325 29317  0 Apr05 ?        00:20:10 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm      29326 29317  0 Apr05 ?        00:00:47 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm      29362 29317  0 Apr05 ?        00:01:38 [rpc.ldmd] <defunct>
> ldm      31349 29317  0 Apr06 ?        00:00:00 [rpc.ldmd] <defunct>
> ldm      31801 31799  0 Apr06 ?        00:00:00 sshd: ldm@pts/0
> 
> ldm      31802 31801  0 Apr06 pts/0    00:00:00 -bash
> ldm      17254 17252  0 Apr10 ?        00:00:00 sshd: ldm@pts/2
> 
> ldm      17255 17254  0 Apr10 pts/2    00:00:00 -bash
> ldm      30953 30951  0 10:14 ?        00:00:00 sshd: ldm@pts/3
> 
> ldm      30954 30953  0 10:14 pts/3    00:00:00 -bash
> ldm      32552 30954  0 19:54 pts/3    00:00:00 ps -fu ldm
> 
> 
> So on tigge-portal we have a master process rpc.ldmd (pid 29317) which
> is the parent of all other rpc.ldmd processes. On tigge-ldm, all
> rpc.ldmd don't show a parent, but init. Is this normal ?

Definitely not!  It might be the cause of your problem -- although I don't see 
exactly how.

Can your netstat(1) show you PID-s?  If so, then use it to discover which of 
the top-level LDM processes on Tigge-ldm are not listening on port 388 and kill 
those processes.  These processes will have PID 1 as their parent PID and will 
be listening on ports other than 388.


Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: CUA-629523
Department: Support IDD TIGGE
Priority: Normal
Status: On Hold