From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20533 invoked from network); 26 May 2004 05:18:13 +0000 Received: from smtp.gentoo.org (156.56.111.197) by parrot.ussg.indiana.edu with SMTP; 26 May 2004 05:18:13 +0000 Received: from parrot.ussg.indiana.edu ([156.56.111.196] helo=parrot.gentoo.org) by smtp.gentoo.org with esmtp (Exim 4.34) id 1BSqns-0005xs-6q for arch-gentoo-dev@lists.gentoo.org; Wed, 26 May 2004 05:18:12 +0000 Received: (qmail 19311 invoked by uid 89); 26 May 2004 05:18:11 +0000 Mailing-List: contact gentoo-dev-help@gentoo.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@gentoo.org Received: (qmail 25455 invoked from network); 26 May 2004 05:18:11 +0000 From: Kevin To: gentoo-dev@lists.gentoo.org Date: Thu, 20 May 2004 08:19:09 -0400 User-Agent: KMail/1.5.94 References: <793F9D20-A427-11D8-AC04-0003939E069A@mac.com> <20040518120228.GE17964%jmglov@jmglov.net> <200405191348.12015.gentoo-dev@gnosys.biz> In-Reply-To: <200405191348.12015.gentoo-dev@gnosys.biz> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <200405200819.09310.gentoo-dev@gnosys.biz> Subject: Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels X-Archives-Salt: b9fcddd8-b5ae-436c-97ec-4ccfeaadc889 X-Archives-Hash: 19ec20e3da21b14808c49dceff0e3466 Hi All- A final note to this thread. After trying for many hours of high-intensity cpu activity (like emerging many packages---which is what used to cause the MCE), since replacing my stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have two identical cpus, even in stepping levels (whereas this was not true before), I have been unable to reproduce my MCE 0004 error. I even did this with the kernel compiled with -march=pentium4 CFLAGS (that caused an MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and stepping level 9 cpus installed). Naturally, I'm delighted by this, however, the whole experience has been somewhat confusing (although enlightening in many respects). Memtest86 still behaves exactly as it did before the hardware replacement. Any thoughts on why it behaves this way with this hardware (unable to set address range limits, unable to force ECC testing on, program locks after 2 minutes of operation). I suppose that's a question for another thread in another forum. I seem to have suffered from no hardware failures on the M/B, the CPUs (one of the old CPUs is still present---I replaced the other), or the RAM (although I suppose the stepping level 7 Xeon might have had some incredibly subtle flaw that only showed up with another CPU present). The replacement hardware seems to suffer no problems at all, in spite of what Memtest86 does (fails at 1023.8MB 30 or 40 times and then freezes). I really appreciate all of the suggestions here. You guys convinced me that it was hardware which is why I replaced everything and that ultimately solved the problem, although it's not clear that there was really a hardware problem. The lesson I've learned (though I'm not sure this is really the root issue) is that when doing multi-processor computing, make sure that both processors are identical in every way. Any thoughts on the accuracy of this rule? But the bizarre thing is that I couldn't reproduce this MCE at all using another distribution on the same (pre-replacement) hardware. Does Gentoo push the hardware much harder than other distros? Perhaps because I'm compiling the code for my particular hardware vice running code that was built to run on many different sets of hardware (less aggressive CFLAGS et. al.)? I'm at a loss to explain this. Again, many thanks for all the help here. -Kevin -- gentoo-dev@gentoo.org mailing list