[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20041007: LDM situation on bigbird (cont.)



Ah... Hardware problems... Or, maybe, software problems.

I've been thinking about this and I'm going to try something on another box later today: I am beginning to believe that FC2 out of the box and perhaps in all its incarnations, has an SMP bug. Can't put my finger on it but I've seen APIC problems reported on Bigbird and Page4 (another SMP box) with FC2. I'm going to load FC1 on Page4 today (it's been down hard for a long time, with a combination of hardware problems starting when the onboard SCSI controller dumped and corrupted my boot sector. The first 2 Dell hardware visits brought partial success each; the 3rd resulted in effectively a whole new machine. Still seeing interupt controller problems, though... Hmmm.

If that's the case, we can salvage the LDM home directory, and of course, the RAID, drop back to FC1, or if needed, RH9, and go ahead. Option B is to place SuSE 9.1 Pro on it; I've got the CDs on my desk for an AMD64 install...

Tom, there's another issue on Bigbird that's user-engendered which I'd like to talk to you about later today, if possible. Cellphone or office...

Thanks, Gerry

Unidata Support wrote:
From: Gerry Creager n5jxs <address@hidden>
Organization: AATLT, Texas A&M University
Keywords: 200410010304.i9134kUE020346 LDM bigbird hardware


Hi Gerry,


Your description of the scenario is consistent in timing, but I was seeing from the logs that a number of processes had exited abnormally, and a quick 'top' showed nothing running.


I was noticing all of the abnormal terminations in bigbird's LDM log file
also, but I focused on the SIGTERM signal report by the lead rpc.ldmd
process.  The only way a SIGTERM can be reported is if one shuts down
the LDM.


So, I executed a 'stop' and 'start' and data started flowing again. Serendipitous perhaps... but the absense of running processes in top suggested it was hosed up again.


OK.  This explains the SIGTERM entry in the log file.


I'll continue to watch this and also see about getting one of my students to research large file support in FC2.
I'll keep you posted.


My gut feeling at the moment is that bigbird has some sort of a
hardware problem.  The reason I say this is that I rebuilt the LDM on
the test machine in my office (dual 500 Mhz PIII running the most
recent 32-bit FC2 kernel (2.6.8)) with large file support yesterday at
noon.  I then split its feed requests to match those on  bigbird and
setup 3 feeds off of the machine to another box here in the UPC.  This
machine is also processing all data except CONDUIT and CRAFT (I didn't
setup enough disk space for this) with no errors/hiccups/complaints.
I must point out that this machine differs from bigbird in several
fundamental ways:

- it is running the latest FC2 kernel without any serious errors
- it does not have a RAID (it has a single 250 GB hard disk)
- it only has 1 GB of RAM
- its processors are not hyperthreaded

Another reason that I suspect that bigbird has a hardware problem
is your comment that you had show stopping problems when trying to run
the latest FC2 kernel.  We see some APIC errors in /var/log/messages,
but not as frequently as you.  Here is a listing of all APIC errors
seen for today:

Oct  7 00:02:03 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:38:23 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:52:52 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:55:12 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:57:12 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 00:58:32 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:00:42 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:01:52 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:02:42 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:34:32 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 01:34:52 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 02:29:41 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 02:58:20 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:02:40 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:04:00 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:11:20 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:24:00 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:34:40 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:39:40 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:49:00 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 03:55:20 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 04:13:19 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 04:15:19 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 04:44:29 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 05:40:08 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:14:48 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:26:27 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:30:47 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 06:49:57 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 07:13:37 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 07:37:16 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 07:51:16 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:34:25 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:46:25 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:51:05 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:52:55 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 08:58:55 dhcp9 kernel: APIC error on CPU0: 40(40)
Oct  7 09:00:05 dhcp9 kernel: APIC error on CPU0: 40(40)

None of these has caused any problems on the machine.

So, where to now?  I hate to say it, but it looks like bigbird may need
some hardware doctoring.

Cheers,

Tom
--
**************************************************************************** <
Unidata User Support                                    UCAR Unidata Program <
(303)497-8643                                                  P.O. Box 3000 <
address@hidden                                   Boulder, CO 80307 <
---------------------------------------------------------------------------- <
Unidata WWW Service              http://my.unidata.ucar.edu/content/support  <
---------------------------------------------------------------------------- <
NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web.  If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.

--
Gerry Creager -- address@hidden
Network Engineering -- AATLT, Texas A&M University  
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

From address@hidden  Fri Oct  8 08:07:43 2004
Return-Path: <address@hidden>
Received: from smtp-relay.tamu.edu (smtp-relay.tamu.edu [165.91.143.199])
        by unidata.ucar.edu (UCAR/Unidata) with ESMTP id i98E7gUE026326;
        Fri, 8 Oct 2004 08:07:42 -0600 (MDT)
Keywords: 200410081407.i98E7gUE026326
Received: from [192.168.1.50] (n5jxs.dsl.tamu.edu [165.91.15.31])
        (authenticated bits=0)
        by smtp-relay.tamu.edu (8.12.10/8.12.10) with ESMTP id i98E7bvT025618
        (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
        Fri, 8 Oct 2004 09:07:38 -0500 (CDT)
Message-ID: <address@hidden>
Date: Fri, 08 Oct 2004 09:07:36 -0500
From: Gerry Creager N5JXS <address@hidden>
Organization: Texas A&M University -- AATLT
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Steve Emmerson <address@hidden>
CC: Unidata Support <address@hidden>, address@hidden
Subject: Re: 20041007: LDM situation on bigbird (cont.)
References: <address@hidden>
In-Reply-To: <address@hidden>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on laraine.unidata.ucar.edu X-Spam-Level: X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=2.63

And therein's a real interesting story. I get a kernel panic on bigbird when I try to boot to the SMP kernel for 2.6.8, based on an APIC violation. Not with the up kernel, which makes sense: The APIC code is unique to the smp code...

I'll try FC1 and see if there's something going on there. I'm also looking at the kernel mailing lists for clues.

gerry

Steve Emmerson wrote:
Tom & Gerry,


Date: Thu, 07 Oct 2004 10:07:30 -0600
From: Unidata Support <address@hidden>
Organization: UCAR/Unidata
To: Gerry Creager n5jxs <address@hidden>
Subject: 20041007: LDM situation on bigbird (cont.)


The above message contained the following:


My gut feeling at the moment is that bigbird has some sort of a
hardware problem.  ...


Another reason that I suspect that bigbird has a hardware problem
is your comment that you had show stopping problems when trying to run
the latest FC2 kernel.  ...


I wouldn't rule-out the possibility of a software bug in the
operating-system.  We're running version 2.6.8 of the kernel while's
Bigbird's running version 2.6.5.  There might have been a problem with
(for example) file-locking or multiprocessor interrupts in the earlier
version that's been fixed in the later one.

Regards,
Steve Emmerson

--
Gerry Creager -- address@hidden
Network Engineering -- AATLT, Texas A&M University  
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

From address@hidden  Fri Oct  8 08:14:48 2004
Return-Path: <address@hidden>
Received: from smtp-relay.tamu.edu (smtp-relay.tamu.edu [165.91.143.199])
        by unidata.ucar.edu (UCAR/Unidata) with ESMTP id i98EEgUE026934;
        Fri, 8 Oct 2004 08:14:43 -0600 (MDT)
Keywords: 200410081414.i98EEgUE026934
Received: from [192.168.1.50] (n5jxs.dsl.tamu.edu [165.91.15.31])
        (authenticated bits=0)
        by smtp-relay.tamu.edu (8.12.10/8.12.10) with ESMTP id i98EEaZJ089584
        (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
        Fri, 8 Oct 2004 09:14:40 -0500 (CDT)
Message-ID: <address@hidden>
Date: Fri, 08 Oct 2004 09:14:36 -0500
From: Gerry Creager N5JXS <address@hidden>
Organization: Texas A&M University -- AATLT
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Unidata Support <address@hidden>,
       Steve Emmerson <address@hidden>
Subject: Lemme try this in a few minutes. Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on laraine.unidata.ucar.edu X-Spam-Level: X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=no version=2.63

Bigbird will go down for a reboot shortly and we'll see how badly the kernel locks when I do that! I'm at the house at this instant but I'm going to reboot and head into the lab for the check-out.


<http://www.fedoraforum.org/forum/archive/index.php/t-21500.html>

--
Gerry Creager -- address@hidden
Network Engineering -- AATLT, Texas A&M University  
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843