From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22329 invoked from network); 19 May 2004 17:50:48 +0000 Received: from smtp.gentoo.org (156.56.111.197) by parrot.ussg.indiana.edu with SMTP; 19 May 2004 17:50:48 +0000 Received: from parrot.ussg.indiana.edu ([156.56.111.196] helo=parrot.gentoo.org) by smtp.gentoo.org with esmtp (Exim 4.34) id 1BQVDL-0007au-LF for arch-gentoo-dev@lists.gentoo.org; Wed, 19 May 2004 17:50:47 +0000 Received: (qmail 32277 invoked by uid 89); 19 May 2004 17:48:46 +0000 Mailing-List: contact gentoo-dev-help@gentoo.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@gentoo.org Received: (qmail 27956 invoked from network); 19 May 2004 17:48:45 +0000 From: Kevin To: gentoo-dev@lists.gentoo.org Date: Wed, 19 May 2004 13:48:12 -0400 User-Agent: KMail/1.5.94 References: <793F9D20-A427-11D8-AC04-0003939E069A@mac.com> <200405171951.32745.gentoo-dev@gnosys.biz> <20040518120228.GE17964%jmglov@jmglov.net> In-Reply-To: <20040518120228.GE17964%jmglov@jmglov.net> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <200405191348.12015.gentoo-dev@gnosys.biz> Subject: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels X-Archives-Salt: 3c93dc15-f05c-4381-9a33-f6b27a5f9d40 X-Archives-Hash: f4c73e2627e22cddc7617930b8e1c5d0 Thanks again for the replies, folks. Well, I've now replaced the system motherboard, the CPU (first tried removing one CPU and memtest86 behaved the exact same way, then replaced the CPU with the new one), and the RAM. Results: memtest86 and friends all behave the exact same way. Could this still be a hardware problem? I'm hard-pressed to believe that I have two different motherboards that just happen to suffer from the same flaw (they are not even the same exact version: one is version B2 and the other is version C4). The only things that are common between the system now and the system before are: (1) the SCSI controller card (RAID card) (another SCSI controller was replaced with the m/b), (2) 2 SCSI hard drives connected to the RAID card, (3) a PCI hardware controller based modem, and (4) the SCSI hot-plug backplane. Could one of these be causing the problem? I haven't tried reproducing my MCE 0004 error again, but memtest86 shows no difference. Can anyone buy into the notion now that memtest86 is doing something that it shouldn't be doing when testing this system? Again, the Dell Utilities are all turning up flawless. I've set the configuration in memtest86 to limit the address range it tests to those addresses below 1022MB or RAM (this is what the Dell utilities test with 1024MB RAM installed), but it ignores those limits and tests up to 1024 anyway and that's where it's still finding its errors (1023.8MB). I've configured memtest86 to turn on ECC testing and it refuses to do so (when I touch (8) for restart tests, the setting returns to off). What's going on here? Any thoughts are most welcome. I'll be trying to reproduce my MCE error with this new hardware, and I'll post results when I have them. Thanks again for all the replies. On Tuesday 18 May 2004 08:02, Josh Glover wrote: > Quoth Kevin (Tue 2004-05-18 04:29:58AM -0400): [...] > > True. Although it is locking up after only 1-2 minutes of operation. > > What conclusion should I draw from that? > > Bad system board. :( I just replaced it. Still does the same thing. > > > Although I'm sure there are others here with more experience > > troubleshooting such problems, I'm thinking that the above is enough > > to base a pretty sound conclusion upon, and the conclusion I would > > draw is that hardware and memory are not the cause of these MCE > > problems. > > Wrong. memtest86 giving you errors almost always indicates a hardware > problem. You have changed the memory, but what remained consistent? The > memory bus! Try a new system board. New system board includes a new memory bus. Still get the same results. > > > I also tried something else that had an enormous positive effect on > > the situation---I changed -march=pentium4 to -march=pentium3 in my > > CFLAGS > > All you have done is turn off SSE2 instructions and possibly a few > others that the P4s have and the P3s do not. If something is wrong with > your system board or CPU, less stress on the CPU is likely not to show > problems as often. That's a good point. I'll try reproducing the MCE now with the new hardware. > > You have bad hardware, Kevin. Try the compile test with one CPU at a > time (i.e. take one out), and if that is not illuminating, replace the > system board. Thanks again gents! -Kevin -- gentoo-dev@gentoo.org mailing list