From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 1801 invoked from network); 18 May 2004 08:29:18 +0000 Received: from toucan.ussg.indiana.edu (HELO smtp.gentoo.org) (156.56.111.197) by parrot.ussg.indiana.edu with SMTP; 18 May 2004 08:29:18 +0000 Received: from newton.random-chaos.org.uk ([195.82.107.148]) by smtp.gentoo.org with esmtp (Exim 4.34) id 1BPzzO-0004YG-4x for arch-gentoo-dev@lists.gentoo.org; Tue, 18 May 2004 08:30:18 +0000 Received: from parrot.ussg.indiana.edu ([156.56.111.196] helo=parrot.gentoo.org) by newton.random-chaos.org.uk with esmtp (Exim 4.21) id 1BPzSE-0000II-Mv for arch-gentoo-dev@gentoo.org; Tue, 18 May 2004 08:56:02 +0100 Received: (qmail 27555 invoked by uid 89); 18 May 2004 08:29:04 +0000 Mailing-List: contact gentoo-dev-help@gentoo.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@gentoo.org Received: (qmail 5188 invoked from network); 18 May 2004 08:29:03 +0000 From: Kevin To: gentoo-dev@lists.gentoo.org Date: Tue, 18 May 2004 04:29:58 -0400 User-Agent: KMail/1.5.94 References: <793F9D20-A427-11D8-AC04-0003939E069A@mac.com> <200405130706.12534.gentoo-dev@gnosys.biz> <20040513155409.GA15192@kroah.com> In-Reply-To: <20040513155409.GA15192@kroah.com> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200405171951.32745.gentoo-dev@gnosys.biz> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Subject: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels X-Archives-Salt: b8816a55-f9b0-4468-8b17-5ed2e62148f1 X-Archives-Hash: 063feaac7ce15200f423ae9e687c7164 Again, thanks to all who have commented on this thread. I've now done some more testing and have some other interesting (though also confusing) results to report. On Thursday 13 May 2004 11:54, Greg KH wrote: > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote: > > Greg KH thinks it's bad memory, > > It's not only me, it's memtest86 saying it :) True. Although it is locking up after only 1-2 minutes of operation. What conclusion should I draw from that? > > > but I'm skeptical of that because the main address that fails (some > > 30 times in a row) is at 1023.8MB and the Dell Utilities only test up > > to 1022MB, and because I haven't seen the problem with the liveCD > > kernel. > > Maybe that's the fault of the Dell utilities. Seriously, I trust > memtest86 over any other vendor specific test. If you don't want to > believe it, that's fine, but I would really consider fixing that issue > before trying to point the finger at the kernel or the Gentoo install. You're right, Greg. I finally took your advice and did some serious testing with the DIMM sticks. This box has 4 slots, DIMMA-DIMMD, and here's what I've done: 1) swapped one 512MB stick for the other in DIMMA/DIMMB (reversed their positions) 2) removed one 512MB stick from DIMMB (configs require filling from DIMMA up) 3) removed the other 512MB stick (so that now I've tried each stick in DIMMA all by itself, and no sticks in any of the other slots) 4) completely replaced each 512MB stick with new ones from Dell and did all of 1-3 above with the new sticks. In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all behave very similarly. That is, they show 1023.8MB (or 511.8MB if only one stick installed) as repeatedly failing (some 30 or 40 times), then they do either (a) show 304.5MB failing and three more failed tests of 1023.8MB (or 511.8MB) and then the program locks up; or (b) show three failed tests at 64.0MB, then three more at 1023.8MB (or 511.8MB), then one more failed test at 64.0MB, then one more at 0.6MB, then one more at 1023.8MB (or 511.8MB), and then the program locks up. Since I had the extra sticks, I also tried testing with all 4 slots filled and got very similar results to those described above, except the repeatedly failing address was 2047.8MB (in all cases, 512MB, 1024MB, and 2048MB, the repeately failing address is 0.2MB below the max). There are no intermittent failing addresses---there are two very specific patterns to the failures, and the program always locks up after following one pattern or the other. In all of the memory configurations I tried, the Dell utilities reported no memory errors (or any other hardware errors). Although I'm sure there are others here with more experience troubleshooting such problems, I'm thinking that the above is enough to base a pretty sound conclusion upon, and the conclusion I would draw is that hardware and memory are not the cause of these MCE problems. I welcome anyone contradicting that conclusion because I've never seen anything like this before and I'm at a loss on how to resolve it. I'm tempted to try replacing one of the CPUs to see if identical stepping levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are otherwise identical) will resolve the problem. I also tried getting memtest86 (and variants) to let me turn on the ECC portion of the tests to no avail, and when I tried sizing the memory, the probe returned 1024MB, the use bios std setting returned 1024MB, and the use bios all setting locked the program up. I also tried something else that had an enormous positive effect on the situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS and built another kernel with identical .config settings. With that kernel running, I did some 2-4 hours of solid compiling work, emerging and re-emerging packages like mysql, cyrus-sasl, cyrus-imapd, mit-krb5, openafs, etc. But unfortunately, this kernel also ended up freezing after doing more of the same, and it did so with the same error message MCE 0000000000000004. I tried using parsemce.c from http://www.codemonkey.org.uk/cruft/parsemce.c/. I built it and ran it, but it wasn't very helpful and I'm not quite sure what I'm supposed to do with it. Chris, I'm going to try your kernel. Thanks for offering that. I'll relate whatever I learn from that test. Again, I really appreciate all the thoughtful replies on what to try next to resolve this problem. If there are any others, or if anyone has suggestions on what to try next, I'd love to hear them. Perhaps I could send my .config file to someone and they could try cross-compiling a kernel for me to try running? Thanks again. -- -Kevin -- gentoo-dev@gentoo.org mailing list