[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030519: NLDN inject machine problems after upgrading to LDM-6



Kevin,

> To: Unidata Support <address@hidden>
> cc: David Knight <address@hidden>,
> cc: address@hidden,
> cc: address@hidden,
> cc: Tom McDermott <address@hidden>
> From: "Kevin R. Tyle" <address@hidden>
> Subject: Re: 20030519: NLDN inject machine problems after upgrading to LDM-6?
> Organization: UCAR/Unidata
> Keywords: 200305191919.h4JJJgLd028120

The above message contained the following:

> This is the 3rd time we've had to restart the ldm since upgrading
> to LDM6.

I'm surprised and a bit alarmed.  We haven't seen this behavior anyplace
else -- but Striker's situation is somewhat unique.  I'd like to get to
the bottom of the problem as quickly as possible.

...
> We're still collecting data to diagnose the problem.  The first
> two times, we saw messages saying "Too many open files" in the
> logs.

To how many downstream LDM-s does Striker send the NLDN data?  Do they
request the data or does Striker initiate the connection?

> This time, such messages did not appear.  However, we did
> see these messages in the ldm log file, beginning about 9
> hours prior to the outage:
> 
> May 17 08:21:57 striker rpc.ldmd[29795]: fork: Not enough space

The above message means that the LDM couldn't fork itself (in order to
respond to an incoming request to the LDM server or to request data from
an upstream LDM).  This was probably due to a shortage of swap space.
How much swap space does Striker have?

> The messages resumed again at 17Z
> 
> May 17 17:16:45 striker rpc.ldmd[29795]: fork: Not enough space
> 
> Shortly after that time, the # of active rpc.ldmd processes dropped
> from 56 to under 20.  These messages popped up:
> 
> May 17 19:31:52 striker updraft(feed)[10782]: h_clnt_call:
> updraft.db.erau.edu: BLKDATA: time elapsed  22.311053

The above message means that Striker took over 22 seconds to send a data
packet to the requesting LDM-5 on Updraft.  It's not a serious error, but
does indicate a problem with the connection.

> Some more could not fork messages appeared,

Those are very bad.

> and then a bunch of gethostbyaddr failures appear.
> 
> May 17 20:10:50 striker rpc.ldmd[29795]: fork: Not enough space
> May 17 20:10:51 striker last message repeated 1 time
> May 17 20:10:52 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 128.174.80.47
> May 17 20:10:52 striker rpc.ldmd[29795]: Denying connection from
> 128.174.80.47
> May 17 20:10:53 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 128.118.28.12
> May 17 20:10:53 striker rpc.ldmd[29795]: Denying connection from
> 128.118.28.12
> May 17 20:10:54 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 130.253.215.243
> May 17 20:10:54 striker rpc.ldmd[29795]: Denying connection from
> 130.253.215.243
> May 17 20:10:54 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 155.42.21.33
> May 17 20:10:54 striker rpc.ldmd[29795]: Denying connection from
> 155.42.21.33
> May 17 20:10:56 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 128.196.30.175
> May 17 20:10:56 striker rpc.ldmd[29795]: Denying connection from
> 128.196.30.175
> May 17 20:10:56 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 129.93.52.150
> May 17 20:10:56 striker rpc.ldmd[29795]: Denying connection from
> 129.93.52.150
> May 17 20:10:57 striker rpc.ldmd[29795]: gethostbyaddr: failed for
> 166.66.44.84
> May 17 20:10:57 striker rpc.ldmd[29795]: Denying connection from
> 166.66.44.84

Striker's inability to resolve the above IP addresses means that the 
host's names won't appear in log messages.  This is unfortunate but not
fatal.

> We've seen the gethostbyaddr failures and fork errors each
> time the LDM had problems.

The fork(2) errors are very bad.

> We're now logging the CPU load, real and virtual memory use, and
> # of open files every 5 minutes.

Excellent!

> Our ingest program was modified to reflect changes in pqinsert.c
> from version 5 to version 6--it is indeed possible this is the
> source of the problem.  I'm going to take a closer look at this.

You might also try rebuilding the LDM package with debugging turned on
(set environment variable CFLAGS to "-g" before running the configure
script and after doing a "make distclean").  If you then set the
corefile size to "unlimited" before running the LDM, this may provide a
core file that might give us some useful information.

> We have not made any changes to the open file limits at the system
> level.
> 
> More info to follow . . .
> 
> --Kevin

Regards,
Steve Emmerson