[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20060105: LDM unexpected death and logging problems (cont.)



>From: Ben Cotton <address@hidden>
>Organization: Purdue
>Keywords: 200601051847.k05IlF7s014054 LDM core

Hi Ben,

re: what OS version are you running
>[ldm@weather ~]$ uname -a
>Linux weather.eas.purdue.edu 2.6.9-22.ELhugemem #1 SMP Mon Sep 19 18:43:10 
>EDT 2005 i686 i686 i386 GNU/Linux

OK, thanks.  Is your kernel/OS uptodate with respect to patches and
upgrades?  I ask because our Fedora Core 3 Linux machines are running
the 2.6.12-xxx kernel.

>I'm unsure of the hardware specifics.  I do know its quite a hefty 
>machine...it is the machine we got with the Unidata equipment grant in 
>'05.  I think it has 1GB of RAM...

You can get specifics on the CPU(s) and memory on a Linux machine
as follows:

cat /proc/cpuinfo
cat /proc/meminfo

re: What 'interrupt' (signal)? is being seen

>I don't know, all I can tell is from the log (attached) entries like:
>Jan 05  06:18:10 pqact[32619] NOTE: Interrupt

OK.  Your log file listing makes it look like pqact is being told
to exit by the lead rpc.ldmd process.  I say this because you are
getting a core dump of rpc.ldmd, and all LDM processes started out
of ~ldm/etc/ldmd.conf belong to the same process group.  When any
one of the processes error exits (like from a segmentation violation)
a signal is sent to the group, and all processes will exit.

re: what is the result of 'file core.nnnnn'

>[ldm@weather ~]$ file core.18872
>core.18872: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), 
>SVR4-style, SVR4-style, from 'rpc.ldmd'
>
>[ldm@weather ~]$ file core.32629
>core.32629: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), 
>SVR4-style, SVR4-style, from 'rpc.ldmd'

This is interesting since dumping of core files from setuid root
programs is turned off by default in Linux.  In order to get the
core file, someone would have had to enable core dumping _if_
rpc.ldmd is, in fact, running with setuid root privilege.

So, the question is if the LDM was installed so that rpc.ldmd and
hupsyslog have setuid root privilege.  Check this with:

<as 'ldm'>
cd ~ldm
ls -alt bin/rpc.ldmd
ls -alt bin/hupsyslog

re: your other machine is not showing the exit problem; is it
running the same version of the LDM

>No, wxp.eas is running an older Linux kernel...
>
>Linux wxp.eas.purdue.edu 2.4.22-1.2199.nptl #1 Wed Aug 4 12:21:48 EDT 2004 
>i686 i686 i386 GNU/Linux

OK.  Since the machines are running different OS versions, doing
a comparison between them (i.e., never seen the exiting problem on the
other machine) is not very useful.

There were some important updates in the 2.6 Linux kernel after the
2.6.9 version.  It may be useful for you to investigate upgrading
your OS kernel if one is available (you are running RedHat Enterprise,
correct?).

>Thanks,

No worries.

For Steve:

Purdue did not turn off SELINUX, so they are not logging to ~ldm/etc/ldmd.log
in the usual way.  Ben's original email reported:

>My LDM 6.4.2 build on weather.eas.purdue.edu has developed the nasty habit
>of dying unexpectedly.  There's been no pattern that I've been able to
>determine, except that it generally happens overnight in oder to make sure
>I don't catch it for hours.  I've asked our department computing support
>staff to check the system logs for anything that might be a trigger, since
>the ldmd.log contains very little information...

>(and in a bit of extra fun, for some reason
>after I manually rotated the logs - cron isn't working properly for some
>reason, long story - the new ldmd.log file remained empty while entries
>were being written to ldmd.log-1 ).

>A core dump appears
>in ~ldm at the same time as the LDM dies, and I assume the two are
>related, but I don't know how to do anything with core files.

>Our other
>machine, wxp.eas.purdue.edu, is running 6.4.1 (although I'm building
>6.4.4 on both this afternoon) and has never had this problem.

>I'm also noticing a what seems like a lack of information in the logs.
>The only messages that are being written or the WARNs that a write to
>pipe took x number of seconds.  I've checked /etc/syslog.conf ,
>~/etc/ldmadmin-pl.conf and the pqact entries in ~/etc/ldmd.conf and
>everything points to /var/log/ldm/ldmd.log .  We put the logs there
>instead of ~/logs (which I set as a symling to /var/log/ldm ) to skirt the
>SELINUX issue.

Here is the output from Ben's ldmd.log.1 file:
Jan 03 19:46:41 pqact[32614] NOTE: Starting Up
Jan 03 19:46:41 pqact[32615] NOTE: Starting Up
Jan 03 19:46:41 pqact[32616] NOTE: Starting Up
Jan 03 19:46:41 pqact[32617] NOTE: Starting Up
Jan 03 19:46:41 pqact[32618] NOTE: Starting Up
Jan 03 19:46:41 pqact[32619] NOTE: Starting Up
Jan 03 19:51:34 pqact[32619] WARN: write(11,,4096) to pipe took 12.455922 s
Jan 03 19:54:06 pqact[32616] WARN: write(6,,4096) to pipe took 2.113214 s
Jan 03 19:57:45 pqact[32616] WARN: write(6,,4096) to pipe took 2.708293 s
Jan 03 19:57:52 pqact[32619] WARN: write(17,,4096) to pipe took 4.661930 s
Jan 03 20:00:55 pqact[32619] WARN: write(7,,4096) to pipe took 10.178098 s
 ...
Jan 04 23:10:17 pqact[32619] WARN: write(8,,4096) to pipe took 5.363954 s
Jan 05 02:10:53 pqact[32619] WARN: write(15,,4096) to pipe took 2.056173 s
Jan 05 06:18:10 pqact[32614] NOTE: Interrupt
Jan 05 06:18:10 pqact[32616] NOTE: Interrupt
Jan 05 06:18:10 pqact[32614] NOTE: Exiting
Jan 05 06:18:10 pqact[32616] NOTE: Exiting
Jan 05 06:18:10 pqact[32615] NOTE: Interrupt
Jan 05 06:18:10 pqact[32615] NOTE: Exiting
Jan 05 06:18:10 pqact[32618] NOTE: Interrupt
Jan 05 06:18:10 pqact[32618] NOTE: Exiting
Jan 05 06:18:10 pqact[32617] NOTE: Interrupt
Jan 05 06:18:10 pqact[32617] NOTE: Exiting
Jan 05 06:18:10 pqact[32619] NOTE: Interrupt
Jan 05 06:18:10 pqact[32619] NOTE: Exiting

Cheers,

Tom
--
NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web.  If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.