[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20041007: LDM situation on bigbird (cont.)



>From: Gerry Creager n5jxs <address@hidden>
>Organization: AATLT, Texas A&M University
>Keywords: 200410010304.i9134kUE020346 LDM bigbird hardware

Hi Gerry,

>Your description of the scenario is consistent in timing, but I was 
>seeing from the logs that a number of processes had exited abnormally, 
>and a quick 'top' showed nothing running.

I was noticing all of the abnormal terminations in bigbird's LDM log file
also, but I focused on the SIGTERM signal report by the lead rpc.ldmd
process.  The only way a SIGTERM can be reported is if one shuts down
the LDM.

>So, I executed a 'stop' and 
>'start' and data started flowing again.  Serendipitous perhaps... but 
>the absense of running processes in top suggested it was hosed up again.

OK.  This explains the SIGTERM entry in the log file.

>I'll continue to watch this and also see about getting one of my 
>students to research large file support in FC2.
>I'll keep you posted.

My gut feeling at the moment is that bigbird has some sort of a
hardware problem.  The reason I say this is that I rebuilt the LDM on
the test machine in my office (dual 500 Mhz PIII running the most
recent 32-bit FC2 kernel (2.6.8)) with large file support yesterday at
noon.  I then split its feed requests to match those on  bigbird and
setup 3 feeds off of the machine to another box here in the UPC.  This
machine is also processing all data except CONDUIT and CRAFT (I didn't
setup enough disk space for this) with no errors/hiccups/complaints.
I must point out that this machine differs from bigbird in several
fundamental ways:

- it is running the latest FC2 kernel without any serious errors
- it does not have a RAID (it has a single 250 GB hard disk)
- it only has 1 GB of RAM
- its processors are not hyperthreaded

Another reason that I suspect that bigbird has a hardware problem
is your comment that you had show stopping problems when trying to run
the latest FC2 kernel.  We see some APIC errors in /var/log/messages,
but not as frequently as you.  Here is a listing of all APIC errors
seen for today:

Oct  7 00:02:03 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:38:23 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:52:52 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:55:12 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:57:12 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:58:32 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:00:42 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:01:52 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:02:42 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:34:32 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:34:52 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 02:29:41 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 02:58:20 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:02:40 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:04:00 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:11:20 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:24:00 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:34:40 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:39:40 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:49:00 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:55:20 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 04:13:19 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 04:15:19 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 04:44:29 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 05:40:08 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:14:48 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:26:27 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:30:47 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:49:57 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 07:13:37 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 07:37:16 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 07:51:16 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:34:25 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:46:25 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:51:05 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:52:55 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:58:55 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 09:00:05 dhcp9 kernel: APIC error on CPU0: 40(40)

None of these has caused any problems on the machine.

So, where to now?  I hate to say it, but it looks like bigbird may need
some hardware doctoring.

Cheers,

Tom
--
NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web.  If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.