* [gentoo-dev] Major MCE problem with SMP on Gentoo kernels @ 2004-05-11 18:07 Kevin 2004-05-11 18:46 ` Greg KH 2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover 0 siblings, 2 replies; 41+ messages in thread From: Kevin @ 2004-05-11 18:07 UTC (permalink / raw To: Gentoo Dev Hi All- I'm writing here first before reporting a bug because perhaps I'm missing something important here (and because I'm not sure what details to supply if I do report a bug because I'm not sure if the problem lies with the gentoo kernels or with gcc or something else). If I am missing something, however, I'm not the only Gentoo user who's missing it, so I think that's unlikely. I saw a thread on lkml in March from somebody else with extremely similar circumstances---though not identical---and running Gentoo---he thought it was a kernel bug but I don't think so: see http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=ISO-8859-1&threadm=1yyJD-8mD-11% 40gated-at.bofh.it&rnum=6&prev=/groups%3Fq%3Dgroup:linux.kernel%2Bsmp% 2Bgentoo%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DISO-8859-1%26sa%3DG% 26scoring%3Dd or search for "group:linux.kernel smp gentoo" on google groups, or see lkml thread: SMP + Hyperthreading / Asus PCDL Deluxe / Kernel 2.4.x 2.6.x / Crash/Freeze). Instead, I think the most likely explanation for my problem is a bug in some Gentoo code somewhere, perhaps related to building kernels, but maybe not... Maybe related to building gcc itself? Not sure. In summary, my problem is this: of those that I've tried, I can't get any Gentoo kernel to handle SMP operation during major CPU activity (like emerging packages) for more than about 5 or 10 minutes. Invariably, during such activity, I get a kernel panic---most often with words on the console about Machine Check Exception 000000...004 (this number from memory so it may be off). The only way that I can get reliable, stable operation with a Gentoo kernel and distribution is if I build a kernel without support for SMP. This is stable with or without hyperthreading enabled in CMOS. Over the last week or so, I've tried running kernels with the CMOS setting for hyperthreading disabled and enabled, with support for SMP enabled and disabled, in all combinations and for the latest stable ebuilds of the following Gentoo kernels: vanilla-sources, gentoo-sources, gentoo-dev-sources, gs-sources (actually, I couldn't even get this to build---see bug #48973 and thread here: gs-sources problems: device mapper: dm.o has undeclared identifiers). Going by memory, I tried kernel versions 2.4.25 (vanilla?), 2.4.26 (gentoo?), and 2.6.5 (gentoo-dev?). In all of the above circumstances, when running a kernel with support for SMP and when emerging packages (some pretty small ones, so only about 3 or 5 minutes of compiling), the machine would lock with a kernel panic and need a hard reset. The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID controller (using AMI megaraid2 driver) and a LSI Logic Corp controller (using Fusion MPT base driver) for the SCSI DAT and with dual 2.4GHz Xeon processors, each having a 512KB L2 Cache. Output from /proc/cpuinfo is: ======= processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2392.127 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4771.02 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2392.127 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4771.02 processor : 2 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 9 cpu MHz : 2392.127 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4771.02 processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 9 cpu MHz : 2392.127 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4771.02 ======= I wrote some details about this problem in gentoo-user under the thread: 2004.1 and SMP Problems, but since then, have done lots more testing. The reason that I think this is a Gentoo thing and not a kernel thing is that today I just finished installing SuSE9 on this same machine with CMOS hyperthreading setting enabled and the CPUs have been wailing away for hours doing simultaneous builds of several different source tarballs (bind9, kde3.2.2, mysql 4.0.18), and I haven't seen even a single problem. During these tests, I was running the SuSE kernel 2.4.21-215-smp4G. In SuSE, the output of /proc/cpuinfo is close, but not exactly the same as above. There are some differences in the flags and a couple other things (use diff for specifics). SuSE /proc/cpuinfo: ======= processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2392.795 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 4718.59 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2392.795 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 4767.74 processor : 2 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 9 cpu MHz : 2392.795 cache size : 512 KB physical id : 2 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 4767.74 processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 9 cpu MHz : 2392.795 cache size : 512 KB physical id : 2 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 4767.74 ======= Gentoo emerge info output: ======= System uname: 2.4.25-gentoo-r2 i686 Intel(R) Xeon(TM) CPU 2.40GHz Gentoo Base System version 1.4.9 distcc 2.13 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled] ccache version 2.3 [enabled] Autoconf: sys-devel/autoconf-2.58-r1 Automake: sys-devel/automake-1.8.3 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" CHOST="i686-pc-linux-gnu" COMPILER="gcc3" CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.2/share/config /usr/kde/3/share/config /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control" CONFIG_PROTECT_MASK="/etc/afs/C /etc/afs/afsws /etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs ccache distcc sandbox" GENTOO_MIRRORS="http://128.213.5.34/gentoo/ http://mirror.datapipe.net/gentoo ftp://mirrors.sec.informatik.tu-darmstadt.de/gentoo/ http://gentoo.eliteitminds.com" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="" SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage" USE="X Xaw3d acl acpi afs alsa apache2 apm arts avi berkdb bonobo caps crypt cups doc emacs emacs-w3 encode esd ethereal evo firebird flac foomaticdb gdbm gif gnome gpm gstreamer gtk gtk2 gtkhtml guile hardened icq imagemagick imap imlib innodb ipv6 jabber jack java jikes jpeg kde kerberos krb4 ldap libg++ libwww mad mcal mikmod motif mozilla mpeg mysql ncurses nls odbc oggvorbis opengl oss pam pda pdflib perl plotutils png ppds prelude python qt quicktime readline ruby samba sasl sdl slang slp spell sse ssl svga tcltk tcpd tetex tiff truetype unicode usb vhosts x86 xinerama xml2 xmms xv zeo zlib" ======= I'm a recent Gentoo convert. I think it's an excellent improvement on the traditional Linux distros, and I'd really like to use it on my server, but as long as this problem with SMP is present, I just can't. If anyone has any suggestions on what I might be doing wrong and how I can get a stable gentoo system with full support for SMP (ideally in a 2.4.x kernel since I need OpenAFS and OpenAFS doesn't work with 2.6.x right now; nor for the near future say the developers of OAFS), I would really appreciate getting your thoughts. Thanks in advance. -- -Kevin PS. FWIW, I'll add that I have a very vague memory while watching text fly up the screen during bootstrap.sh or emerge system (this was a stage 1 install before I installed SuSE over it) of seeing some warning about something being unsafe with SMP. Do I need to have some setting or other turned off for some parts of a stage 1 install with a dual CPU system? -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin @ 2004-05-11 18:46 ` Greg KH 2004-05-11 18:55 ` Kevin 2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover 1 sibling, 1 reply; 41+ messages in thread From: Greg KH @ 2004-05-11 18:46 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote: > In summary, my problem is this: of those that I've tried, I can't get any > Gentoo kernel to handle SMP operation during major CPU activity (like > emerging packages) for more than about 5 or 10 minutes. Invariably, > during such activity, I get a kernel panic---most often with words on the > console about Machine Check Exception 000000...004 (this number from > memory so it may be off). This means you have bad hardware (memory, cpu, overheating, etc.) It's the hardware saying that something bad just happened, nothing the OS or distro did wrong here. I suggest you track down that exact error message and determine what it means (there are tables and a tool that does that, sorry I can't remember what it is...) Good luck, greg k-h -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 18:46 ` Greg KH @ 2004-05-11 18:55 ` Kevin 2004-05-11 19:04 ` Greg KH 2004-05-11 19:38 ` Paul de Vrieze 0 siblings, 2 replies; 41+ messages in thread From: Kevin @ 2004-05-11 18:55 UTC (permalink / raw To: gentoo-dev On Tuesday 11 May 2004 14:46, Greg KH wrote: > On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote: > > In summary, my problem is this: of those that I've tried, I can't > > get any Gentoo kernel to handle SMP operation during major CPU > > activity (like emerging packages) for more than about 5 or 10 > > minutes. Invariably, during such activity, I get a kernel > > panic---most often with words on the console about Machine Check > > Exception 000000...004 (this number from memory so it may be off). > > This means you have bad hardware (memory, cpu, overheating, etc.) > It's the hardware saying that something bad just happened, nothing > the OS or distro did wrong here. Thanks for your reply, Greg. Although what you say here may be true in some circumstances, I think you're wrong in this case. You may have stopped reading after the above paragraph, but in the rest of my post, I describe how a SuSE9 distro installed on this same hardware has no problems doing all of the things that failed in Gentoo. That's a pretty strong indication that there are no hardware problems, isn't it? -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 18:55 ` Kevin @ 2004-05-11 19:04 ` Greg KH 2004-05-11 19:38 ` Kevin 2004-05-11 19:38 ` Paul de Vrieze 1 sibling, 1 reply; 41+ messages in thread From: Greg KH @ 2004-05-11 19:04 UTC (permalink / raw To: Kevin; +Cc: gentoo-dev On Tue, May 11, 2004 at 02:55:47PM -0400, Kevin wrote: > On Tuesday 11 May 2004 14:46, Greg KH wrote: > > On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote: > > > In summary, my problem is this: of those that I've tried, I can't > > > get any Gentoo kernel to handle SMP operation during major CPU > > > activity (like emerging packages) for more than about 5 or 10 > > > minutes. Invariably, during such activity, I get a kernel > > > panic---most often with words on the console about Machine Check > > > Exception 000000...004 (this number from memory so it may be off). > > > > This means you have bad hardware (memory, cpu, overheating, etc.) > > It's the hardware saying that something bad just happened, nothing > > the OS or distro did wrong here. > > Thanks for your reply, Greg. Although what you say here may be true in > some circumstances, I think you're wrong in this case. You may have > stopped reading after the above paragraph, but in the rest of my post, > I describe how a SuSE9 distro installed on this same hardware has no > problems doing all of the things that failed in Gentoo. That's a > pretty strong indication that there are no hardware problems, isn't it? Not at all. Different compilers/kernels/programs exercise hardware in very different ways. It could be that your compiler settings for Gentoo causes different instructions to be used for the same program on SuSE. Try running memtest86 overnight as a good start to rule out your memory. Good luck, greg k-h -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 19:04 ` Greg KH @ 2004-05-11 19:38 ` Kevin 2004-05-11 20:54 ` Chris Gianelloni 0 siblings, 1 reply; 41+ messages in thread From: Kevin @ 2004-05-11 19:38 UTC (permalink / raw To: gentoo-dev On Tuesday 11 May 2004 15:04, Greg KH wrote: > On Tue, May 11, 2004 at 02:55:47PM -0400, Kevin wrote: > > Thanks for your reply, Greg. Although what you say here may be > > true in some circumstances, I think you're wrong in this case. You > > may have stopped reading after the above paragraph, but in the rest > > of my post, I describe how a SuSE9 distro installed on this same > > hardware has no problems doing all of the things that failed in > > Gentoo. That's a pretty strong indication that there are no > > hardware problems, isn't it? > > Not at all. Different compilers/kernels/programs exercise hardware > in very different ways. It could be that your compiler settings for > Gentoo causes different instructions to be used for the same program > on SuSE. > > Try running memtest86 overnight as a good start to rule out your > memory. Ok. Thanks for the suggestion. But what about this: Dell has a utility partition and some programs for doing exhaustive testing of all the hardware in the server. If I run the most thorough set of tests available in this utility partition and I get a clean bill of health, is that a reliable indication that there are no hardware problems? Or does memtest86 do testing that's more exhaustive than most such utility suites? If the utility partition testing says all is well (I've done it several times in the last month or so, though maybe not the most extensive tests), what's the next place to look for an explanation of why this MCE is happening in Gentoo but not in SuSE? -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 19:38 ` Kevin @ 2004-05-11 20:54 ` Chris Gianelloni 2004-05-11 21:31 ` Kevin 0 siblings, 1 reply; 41+ messages in thread From: Chris Gianelloni @ 2004-05-11 20:54 UTC (permalink / raw To: Kevin; +Cc: gentoo-dev On Tue, 2004-05-11 at 15:38, Kevin wrote: > Ok. Thanks for the suggestion. But what about this: Dell has a utility > partition and some programs for doing exhaustive testing of all the > hardware in the server. If I run the most thorough set of tests > available in this utility partition and I get a clean bill of health, > is that a reliable indication that there are no hardware problems? Or > does memtest86 do testing that's more exhaustive than most such utility > suites? I think the Dell suite would be more extensive. > If the utility partition testing says all is well (I've done it several > times in the last month or so, though maybe not the most extensive > tests), what's the next place to look for an explanation of why this > MCE is happening in Gentoo but not in SuSE? Are you sure that it isn't MCE *causing* these problems? Have you tried turning it off and seeing if you still have the same kinds of problems? -- Chris Gianelloni Developer, Gentoo Linux Games Team Is your power animal a penguin? -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 20:54 ` Chris Gianelloni @ 2004-05-11 21:31 ` Kevin 0 siblings, 0 replies; 41+ messages in thread From: Kevin @ 2004-05-11 21:31 UTC (permalink / raw To: gentoo-dev On Tuesday 11 May 2004 16:54, Chris Gianelloni wrote: > On Tue, 2004-05-11 at 15:38, Kevin wrote: > > Ok. Thanks for the suggestion. But what about this: Dell has a > > utility partition and some programs for doing exhaustive testing of > > all the hardware in the server. If I run the most thorough set of > > tests available in this utility partition and I get a clean bill of > > health, is that a reliable indication that there are no hardware > > problems? Or does memtest86 do testing that's more exhaustive than > > most such utility suites? > > I think the Dell suite would be more extensive. Thanks for saying so, Chris. > > > If the utility partition testing says all is well (I've done it > > several times in the last month or so, though maybe not the most > > extensive tests), what's the next place to look for an explanation > > of why this MCE is happening in Gentoo but not in SuSE? > > Are you sure that it isn't MCE *causing* these problems? Have you > tried turning it off and seeing if you still have the same kinds of > problems? I'm not sure I understand what you mean by that. The first time I got a kernel panic and MCE, I believe that the kernel I was running had no configured capability to deal with MCE errors (though I'm not sure of that). I had never seen an MCE before, but after this first time, with any other kernels I built, I searched through the .config file options for handlers of MCE errors and built them into the kernel where they were available. IIRC, then when I got a kernel panic with those kernels, I had some more information (apparently generated by the kernel) on the console than I did with the first MCE. I add this information in case it relates to your question or point here, but I'm really not sure what you mean by, "Have you tried turning it off..." Where do I turn it off? Do you mean the .config file parameter in the kernel configuration process that builds (or not) a handler for the MCE errors? Or do you mean something else? Honestly, I'm thinking that I may have somehow built some software (during the stage 1 installation process) that is causing these problems, but I followed the Gentoo Handbook for doing a stage 1 installation pretty rigidly, so I'm not sure what I might have done to cause that. When I did the bootstrap.sh and emerge system, I was running the kernel that I booted from the boot CD (2004.0 I think, and probably even the smp kernel that was on that CD---IIRC, the 2004.1 boot CD has some problems that prevent the use of the smp kernel on that CD). In fact, now that I think of it, I'm pretty sure I didn't get any MCE kernel panics until after I finished emerge system and other tasks and then rebooted my new Gentoo system. Perhaps this helps isolate the cause of the problems. While I was doing the bootstrap.sh and emerge system, it's definitely true that I was stressing the system out with lots of compile jobs (which is what has been triggering my MCEs), but I'm pretty sure I did not get any MCE failures during those steps. Does this help someone figure out what's going on in my case? Are there some compiler flags or other configurable settings that, if set to certain values during the bootstrap.sh or emerge system steps, could end up generating software (perhaps when I built my own gcc?) that would cause these MCEs to be thrown? Like I said in my PS in my first post, I have this vague memory of seeing something that said, such-and-such is not smp safe. Have no clue what that might have been now, though, or even if it's an accurate memory. Some of this work was done in the wee hours... Thanks for the replies and any other suggestions. -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 18:55 ` Kevin 2004-05-11 19:04 ` Greg KH @ 2004-05-11 19:38 ` Paul de Vrieze 2004-05-11 21:37 ` Kevin 1 sibling, 1 reply; 41+ messages in thread From: Paul de Vrieze @ 2004-05-11 19:38 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1437 bytes --] On Tuesday 11 May 2004 20:55, Kevin wrote: > On Tuesday 11 May 2004 14:46, Greg KH wrote: > > On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote: > > > In summary, my problem is this: of those that I've tried, I can't > > > get any Gentoo kernel to handle SMP operation during major CPU > > > activity (like emerging packages) for more than about 5 or 10 > > > minutes. Invariably, during such activity, I get a kernel > > > panic---most often with words on the console about Machine Check > > > Exception 000000...004 (this number from memory so it may be off). > > > > This means you have bad hardware (memory, cpu, overheating, etc.) > > It's the hardware saying that something bad just happened, nothing > > the OS or distro did wrong here. > > Thanks for your reply, Greg. Although what you say here may be true in > some circumstances, I think you're wrong in this case. You may have > stopped reading after the above paragraph, but in the rest of my post, > I describe how a SuSE9 distro installed on this same hardware has no > problems doing all of the things that failed in Gentoo. That's a > pretty strong indication that there are no hardware problems, isn't it? Do you also have errors when you run a vanilla kernel? What if you take the kernel from SUSE, which compiler do you use? Paul -- Paul de Vrieze Gentoo Developer Mail: pauldv@gentoo.org Homepage: http://www.devrieze.net [-- Attachment #2: signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 19:38 ` Paul de Vrieze @ 2004-05-11 21:37 ` Kevin 2004-05-12 1:02 ` Georgi Georgiev 0 siblings, 1 reply; 41+ messages in thread From: Kevin @ 2004-05-11 21:37 UTC (permalink / raw To: gentoo-dev On Tuesday 11 May 2004 15:38, Paul de Vrieze wrote: > On Tuesday 11 May 2004 20:55, Kevin wrote: > > > > Thanks for your reply, Greg. Although what you say here may be > > true in some circumstances, I think you're wrong in this case. You > > may have stopped reading after the above paragraph, but in the rest > > of my post, I describe how a SuSE9 distro installed on this same > > hardware has no problems doing all of the things that failed in > > Gentoo. That's a pretty strong indication that there are no > > hardware problems, isn't it? > > Do you also have errors when you run a vanilla kernel? Yes, I mentioned that in my first post. Or by vanilla do you mean a kernel from kernel.org (as opposed to the Gentoo vanilla-sources kernel---that's the one I tried; isn't that identical to the kernel from kernel.org?) > What if you > take the kernel from SUSE, I haven't tried installing Gentoo with my SuSE kernel running. Huh... what a concept. With all the modularity of those default distro kernels, would that even work? Maybe I'd need the kernel, the System.map, and the /lib/modules/`uname -r` directory? > which compiler do you use? I built the standard compiler that you get with ACCEPT_KEYWORDS="x86" (stable). gcc and friends. Whatever is the standard stable ebuild is the one I built. Thanks. -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 21:37 ` Kevin @ 2004-05-12 1:02 ` Georgi Georgiev 2004-05-12 10:23 ` [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels) sf 0 siblings, 1 reply; 41+ messages in thread From: Georgi Georgiev @ 2004-05-12 1:02 UTC (permalink / raw To: gentoo-dev maillog: 11/05/2004-17:37:39(-0400): Kevin types > On Tuesday 11 May 2004 15:38, Paul de Vrieze wrote: > > On Tuesday 11 May 2004 20:55, Kevin wrote: > > > > > > Thanks for your reply, Greg. Although what you say here may be > > > true in some circumstances, I think you're wrong in this case. You > > > may have stopped reading after the above paragraph, but in the rest > > > of my post, I describe how a SuSE9 distro installed on this same > > > hardware has no problems doing all of the things that failed in > > > Gentoo. That's a pretty strong indication that there are no > > > hardware problems, isn't it? > > > > Do you also have errors when you run a vanilla kernel? > > Yes, I mentioned that in my first post. Or by vanilla do you mean a > kernel from kernel.org (as opposed to the Gentoo vanilla-sources > kernel---that's the one I tried; isn't that identical to the kernel > from kernel.org?) You didn't try development-sources, according to your original post. You only mention 2.6.5 gentoo-sources. What about a vanilla 2.6.5 (2.6.6 already)? > > What if you > > take the kernel from SUSE, > > I haven't tried installing Gentoo with my SuSE kernel running. Huh... > what a concept. With all the modularity of those default distro > kernels, would that even work? Maybe I'd need the kernel, the > System.map, and the /lib/modules/`uname -r` directory? You don't even need the System.map. -- *- Georgi Georgiev *- By golly, I'm beginning to think Linux *- -* chutz@gg3.net -* really *is* the best thing since sliced -* *- +81(90)6266-1163 *- bread. *- -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels) 2004-05-12 1:02 ` Georgi Georgiev @ 2004-05-12 10:23 ` sf 0 siblings, 0 replies; 41+ messages in thread From: sf @ 2004-05-12 10:23 UTC (permalink / raw To: gentoo-dev Georgi Georgiev wrote: > maillog: 11/05/2004-17:37:39(-0400): Kevin types ... >>I haven't tried installing Gentoo with my SuSE kernel running. Huh... >>what a concept. With all the modularity of those default distro >>kernels, would that even work? Maybe I'd need the kernel, the >>System.map, and the /lib/modules/`uname -r` directory? > > > You don't even need the System.map. > But you most likely will need an initrd. I am using the custom 2.6.4 kernel from SuSE 9.1 (it has working isdn support for avm fritzcards with avm's binary modules) on one of my gentoo systems. This kernel does not have reiserfs compiled in. If you want to use SuSE's initrd you have to setup udev as well. Until now everything works perfectly. Regards Stephan -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin 2004-05-11 18:46 ` Greg KH @ 2004-05-12 2:42 ` Josh Glover 2004-05-12 9:31 ` Dan Podeanu 2004-05-12 11:24 ` Kevin 1 sibling, 2 replies; 41+ messages in thread From: Josh Glover @ 2004-05-12 2:42 UTC (permalink / raw To: Gentoo Dev [-- Attachment #1: Type: text/plain, Size: 6654 bytes --] Quoth Kevin (Tue 2004-05-11 02:07:58PM -0400): > In summary, my problem is this: of those that I've tried, I can't get any > Gentoo kernel to handle SMP operation during major CPU activity (like > emerging packages) for more than about 5 or 10 minutes. Invariably, > during such activity, I get a kernel panic---most often with words on the > console about Machine Check Exception 000000...004 (this number from > memory so it may be off). [...] > The only way that I can get reliable, stable operation with a Gentoo > kernel and distribution is if I build a kernel without support for SMP. [...] > The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID > controller (using AMI megaraid2 driver) and a LSI Logic Corp controller > (using Fusion MPT base driver) for the SCSI DAT and with dual 2.4GHz Xeon > processors, each having a 512KB L2 Cache. Running Gentoo with a 2.6.5 SMP kernel on a Dell PowerEdge 400SC: : jmglov@jglover; uname -a Linux jglover 2.6.5-gentoo-r1 #1 SMP Fri Apr 30 17:37:18 EDT 2004 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux > Output from /proc/cpuinfo is: <snip> : jmglov@jglover; cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.40GHz stepping : 9 cpu MHz : 2395.027 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4718.59 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.40GHz stepping : 9 cpu MHz : 2395.027 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4767.74 > I'm a recent Gentoo convert. I think it's an excellent improvement on the > traditional Linux distros, and I'd really like to use it on my server, > but as long as this problem with SMP is present, I just can't. I really do not think it is a Gentoo issue. I have run Gentoo on quite a few SMP boxen over the past several years, and never had problems like you describe. Sounds like a hardware issue to me, unless you are using (or were using) some really bogus CFLAGS. > PS. FWIW, I'll add that I have a very vague memory while watching text fly > up the screen during bootstrap.sh or emerge system (this was a stage 1 > install before I installed SuSE over it) of seeing some warning about > something being unsafe with SMP. Do I need to have some setting or other > turned off for some parts of a stage 1 install with a dual CPU system? Nope. Quoth Kevin (Tue, 11 May 2004 15:38:35 -0400): > Ok. Thanks for the suggestion. But what about this: Dell has a utility > partition and some programs for doing exhaustive testing of all the > hardware in the server. If I run the most thorough set of tests > available in this utility partition and I get a clean bill of health, > is that a reliable indication that there are no hardware problems? Nope. Tragically, it usually works the other way around: hardware test suites are unlikely to give you a false positive, but if your hardware passes, that does not mean you are safe. Your issue might be heat- related, and your CPUs have to heat up for quite some time before they choke. Combine this with some other issue (maybe you optimised a bit aggressively when building your kernel?), and you have a tricky issue for a hardware tester to catch. Quoth Kevin (Tue, 11 May 2004 17:31:32 -0400): > Honestly, I'm thinking that I may have somehow built some software > (during the stage 1 installation process) that is causing these > problems, but I followed the Gentoo Handbook for doing a stage 1 > installation pretty rigidly, so I'm not sure what I might have done to > cause that. Why did you do a Stage 1, just out of curiousity. I recommend doing at least one Stage 1 install for newcomers to Gentoo, just for educational purposes, but after that, go Stage 3 and use as many binary packages as you can! There exists a Stage 3 tarball for your architecture--the Pentium4 one, so why not use that, just to make sure your base system is solid? > When I did the bootstrap.sh and emerge system, I was > running the kernel that I booted from the boot CD (2004.0 I think, and > probably even the smp kernel that was on that CD---IIRC, the 2004.1 > boot CD has some problems that prevent the use of the smp kernel on > that CD). I don't remember that, but I cannot say for certain that I have tried the 2004.1 universal x86 CD with the SMP kernel. > Are there some compiler flags or other configurable settings that, if > set to certain values during the bootstrap.sh or emerge system steps, > could end up generating software (perhaps when I built my own gcc?) > that would cause these MCEs to be thrown? I dunno, why don't you post your CFLAGS and MAKEOPTS from your make.conf here? Quoth Kevin (Tue, 11 May 2004 17:37:39 -0400): > On Tuesday 11 May 2004 15:38, Paul de Vrieze wrote: > >> What if you take the kernel from SUSE, > > I haven't tried installing Gentoo with my SuSE kernel running. Huh... > what a concept. With all the modularity of those default distro > kernels, would that even work? Maybe I'd need the kernel, the > System.map, and the /lib/modules/`uname -r` directory? Yes, you can install Gentoo while running *any* kernel. As long as you can chroot, you can install Gentoo. See my Faketoo for an example: http://forums.gentoo.org/viewtopic.php?p=1082580 Note that I do not actually build a kernel and setup the bootloader and so forth, since I do not need to boot my jailed Gentoo installation--it is just for ebuild development. However, nothing is stopping *you* from doing it. :) -- Josh Glover GPG keyID 0xDE8A3103 (C3E4 FA9E 1E07 BBDB 6D8B 07AB 2BF1 67A1 DE8A 3103) gpg --keyserver pgp.mit.edu --recv-keys DE8A3103 [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover @ 2004-05-12 9:31 ` Dan Podeanu 2004-05-12 11:26 ` Kevin 2004-05-12 11:24 ` Kevin 1 sibling, 1 reply; 41+ messages in thread From: Dan Podeanu @ 2004-05-12 9:31 UTC (permalink / raw To: Gentoo Dev Some (Greg KH and others) will say I'm crazy, but.. have you tried to compile the kernel with 2.95.3 instead of the latest gentoo stable, 3.3.2 ? In my case it -has- helped on more than a couple of occasions. Cheers, Dan. -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 9:31 ` Dan Podeanu @ 2004-05-12 11:26 ` Kevin 0 siblings, 0 replies; 41+ messages in thread From: Kevin @ 2004-05-12 11:26 UTC (permalink / raw To: Gentoo Dev On Wednesday 12 May 2004 05:31, Dan Podeanu wrote: > Some (Greg KH and others) will say I'm crazy, but.. have you tried to > compile the kernel with 2.95.3 instead of the latest gentoo stable, > 3.3.2 ? In my case it -has- helped on more than a couple of occasions. Haven't tried that, but thanks for the suggestion and the hints of your experience, Dan. -- -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover 2004-05-12 9:31 ` Dan Podeanu @ 2004-05-12 11:24 ` Kevin 2004-05-12 11:48 ` Josh Glover 1 sibling, 1 reply; 41+ messages in thread From: Kevin @ 2004-05-12 11:24 UTC (permalink / raw To: Gentoo Dev Thanks for your reply, Josh. On Tuesday 11 May 2004 22:42, Josh Glover wrote: > Quoth Kevin (Tue 2004-05-11 02:07:58PM -0400): > > The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID > > controller (using AMI megaraid2 driver) and a LSI Logic Corp > > controller (using Fusion MPT base driver) for the SCSI DAT and with > > dual 2.4GHz Xeon processors, each having a 512KB L2 Cache. > > Running Gentoo with a 2.6.5 SMP kernel on a Dell PowerEdge 400SC: > : jmglov@jglover; uname -a > > Linux jglover 2.6.5-gentoo-r1 #1 SMP Fri Apr 30 17:37:18 EDT 2004 i686 > Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux > > : jmglov@jglover; cat /proc/cpuinfo > > processor : 0 > vendor_id : GenuineIntel > cpu family : 15 > model : 2 > model name : Intel(R) Pentium(R) 4 CPU 2.40GHz > stepping : 9 > cpu MHz : 2395.027 > cache size : 512 KB > physical id : 0 > siblings : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 2 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid > bogomips : 4718.59 > > processor : 1 > vendor_id : GenuineIntel > cpu family : 15 > model : 2 > model name : Intel(R) Pentium(R) 4 CPU 2.40GHz > stepping : 9 > cpu MHz : 2395.027 > cache size : 512 KB > physical id : 0 > siblings : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 2 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid > bogomips : 4767.74 Have you turned off hyperthreading? Why is it that only two CPUs show up? It looks like (flags include ht) the CPUs support hyperthreading... or am I way off base in drawing that conclusion here? > > > I'm a recent Gentoo convert. I think it's an excellent improvement > > on the traditional Linux distros, and I'd really like to use it on my > > server, but as long as this problem with SMP is present, I just > > can't. > > I really do not think it is a Gentoo issue. I have run Gentoo on quite > a few SMP boxen over the past several years, and never had problems > like you describe. Sounds like a hardware issue to me, unless you are > using (or were using) some really bogus CFLAGS. Well, last night I did the most exhaustive set of tests available on the Dell Utility Partition and found zero errors. I did 16 loops of the MATS test and 15 of TestY, TestA, and ECC coupling. The memory tests ran for about 8 hours. Again, zero errors. > Quoth Kevin (Tue, 11 May 2004 17:31:32 -0400): > > Honestly, I'm thinking that I may have somehow built some software > > (during the stage 1 installation process) that is causing these > > problems, but I followed the Gentoo Handbook for doing a stage 1 > > installation pretty rigidly, so I'm not sure what I might have done > > to cause that. > > Why did you do a Stage 1, just out of curiousity. Some silly sense of pride or accomplishment that every bit of code running on the box was actually compiled on it. I really like that Gentoo makes that option a practical one. Silly, I know... > I recommend doing at > least one Stage 1 install for newcomers to Gentoo, just for educational > purposes, but after that, go Stage 3 and use as many binary packages as > you can! There exists a Stage 3 tarball for your architecture--the > Pentium4 one, so why not use that, just to make sure your base system > is solid? That probably is a good thing for me to try next, and I really would like to be able to use all the advantages of Gentoo on this server. I'm fed up with rpm hell and the traditional distros drawbacks. > > > When I did the bootstrap.sh and emerge system, I was > > running the kernel that I booted from the boot CD (2004.0 I think, > > and probably even the smp kernel that was on that CD---IIRC, the > > 2004.1 boot CD has some problems that prevent the use of the smp > > kernel on that CD). > > I don't remember that, but I cannot say for certain that I have tried > the 2004.1 universal x86 CD with the SMP kernel. It may have changed (I did this just a few days after the 2004.1 CD came out, and I think it was the minimal one). > > > Are there some compiler flags or other configurable settings that, if > > set to certain values during the bootstrap.sh or emerge system steps, > > could end up generating software (perhaps when I built my own gcc?) > > that would cause these MCEs to be thrown? > > I dunno, why don't you post your CFLAGS and MAKEOPTS from your > make.conf here? Did already :) Here they are again: On Tuesday 11 May 2004 14:07, Kevin wrote: > Gentoo emerge info output: > ======= > System uname: 2.4.25-gentoo-r2 i686 Intel(R) Xeon(TM) CPU 2.40GHz > Gentoo Base System version 1.4.9 > distcc 2.13 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) > [enabled] > ccache version 2.3 [enabled] > Autoconf: sys-devel/autoconf-2.58-r1 > Automake: sys-devel/automake-1.8.3 > ACCEPT_KEYWORDS="x86" > AUTOCLEAN="yes" > CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" > CHOST="i686-pc-linux-gnu" > COMPILER="gcc3" > CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config > /usr/kde/3.2/share/config /usr/kde/3/share/config > /usr/lib/mozilla/defaults/pref /usr/share/config > /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ > /usr/share/texmf/tex/generic/config/ > /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ > /var/qmail/control" CONFIG_PROTECT_MASK="/etc/afs/C /etc/afs/afsws > /etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O3 -march=pentium4 > -pipe -fomit-frame-pointer" > DISTDIR="/usr/portage/distfiles" > FEATURES="autoaddcvs ccache distcc sandbox" > GENTOO_MIRRORS="http://128.213.5.34/gentoo/ > http://mirror.datapipe.net/gentoo > ftp://mirrors.sec.informatik.tu-darmstadt.de/gentoo/ > http://gentoo.eliteitminds.com" > MAKEOPTS="-j3" > PKGDIR="/usr/portage/packages" > PORTAGE_TMPDIR="/var/tmp" > PORTDIR="/usr/portage" > PORTDIR_OVERLAY="" > SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage" > USE="X Xaw3d acl acpi afs alsa apache2 apm arts avi berkdb bonobo caps > crypt > cups doc emacs emacs-w3 encode esd ethereal evo firebird flac > foomaticdb gdbm > gif gnome gpm gstreamer gtk gtk2 gtkhtml guile hardened icq imagemagick > imap > imlib innodb ipv6 jabber jack java jikes jpeg kde kerberos krb4 ldap > libg++ > libwww mad mcal mikmod motif mozilla mpeg mysql ncurses nls odbc > oggvorbis opengl oss pam pda pdflib perl plotutils png ppds prelude > python qt quicktime > readline ruby samba sasl sdl slang slp spell sse ssl svga tcltk tcpd > tetex tiff > truetype unicode usb vhosts x86 xinerama xml2 xmms xv zeo zlib" > ======= > Thanks again for the thoughtful reply, Josh. -- -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 11:24 ` Kevin @ 2004-05-12 11:48 ` Josh Glover 2004-05-12 12:14 ` Ciaran McCreesh 2004-05-12 13:58 ` Kevin 0 siblings, 2 replies; 41+ messages in thread From: Josh Glover @ 2004-05-12 11:48 UTC (permalink / raw To: Gentoo Dev [-- Attachment #1: Type: text/plain, Size: 3654 bytes --] Quoth Kevin (Wed 2004-05-12 07:24:07AM -0400): > On Tuesday 11 May 2004 22:42, Josh Glover wrote: > > > Quoth Kevin (Tue 2004-05-11 02:07:58PM -0400): > > > > > The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID > > > controller (using AMI megaraid2 driver) and a LSI Logic Corp > > > controller (using Fusion MPT base driver) for the SCSI DAT and with > > > dual 2.4GHz Xeon processors, each having a 512KB L2 Cache. > > > > Running Gentoo with a 2.6.5 SMP kernel on a Dell PowerEdge 400SC: > > : jmglov@jglover; uname -a > > > > Linux jglover 2.6.5-gentoo-r1 #1 SMP Fri Apr 30 17:37:18 EDT 2004 i686 > > Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux > > > > : jmglov@jglover; cat /proc/cpuinfo > > > > processor : 0 [...] > > processor : 1 > > Have you turned off hyperthreading? Why is it that only two CPUs show up? > It looks like (flags include ht) the CPUs support hyperthreading... or am > I way off base in drawing that conclusion here? No, the only part where you went off base was in assuming that I have more than one *physical* CPU in the box. I have one, and with hyperthreading turned on, it looks like two to the kernel. > Well, last night I did the most exhaustive set of tests available on the > Dell Utility Partition and found zero errors. I did 16 loops of the MATS > test and 15 of TestY, TestA, and ECC coupling. The memory tests ran for > about 8 hours. Again, zero errors. [...] > > I recommend doing at > > least one Stage 1 install for newcomers to Gentoo, just for educational > > purposes, but after that, go Stage 3 and use as many binary packages as > > you can! There exists a Stage 3 tarball for your architecture--the > > Pentium4 one, so why not use that, just to make sure your base system > > is solid? > > That probably is a good thing for me to try next, and I really would like > to be able to use all the advantages of Gentoo on this server. I'm fed > up with rpm hell and the traditional distros drawbacks. Yes, I really think that if you use a Stage 3, you can feel pretty confident about your C compiler and libraries being stable. Just be careful when you compile your kernel! > > > Are there some compiler flags or other configurable settings that, if > > > set to certain values during the bootstrap.sh or emerge system steps, > > > could end up generating software (perhaps when I built my own gcc?) > > > that would cause these MCEs to be thrown? > > > > I dunno, why don't you post your CFLAGS and MAKEOPTS from your > > make.conf here? > > Did already :) Lost them in the spew, sorry. > Here they are again: > > ACCEPT_KEYWORDS="x86" This is unnecessary. You only need to use ACCEPT_KEYWORDS with the unstable keywords, ~x86 in your case. > CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" Might want to back off to -O2, at least when you compile the kernel if nothing else. I believe the handbook recommends -O2, and optimising too highly can lead to some pretty bizarre problems. > CXXFLAGS="-O3 -march=pentium4 I am not sure if you are setting this, or if this is just emerge, but you usually want to set CXXFLAGS="${CFLAGS}". > FEATURES="autoaddcvs ccache distcc sandbox" autoaddcvs is a feature only for developers, you should not turn it on. > MAKEOPTS="-j3" You have four CPUs, so set this to -j5. > Thanks again for the thoughtful reply, Josh. Hey, I live to serve. :) -- Josh Glover GPG keyID 0xDE8A3103 (C3E4 FA9E 1E07 BBDB 6D8B 07AB 2BF1 67A1 DE8A 3103) gpg --keyserver pgp.mit.edu --recv-keys DE8A3103 [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 11:48 ` Josh Glover @ 2004-05-12 12:14 ` Ciaran McCreesh 2004-05-12 13:58 ` Kevin 1 sibling, 0 replies; 41+ messages in thread From: Ciaran McCreesh @ 2004-05-12 12:14 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 737 bytes --] On Wed, 12 May 2004 07:48:22 -0400 Josh Glover <jmglov@gentoo.org> wrote: | > CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" | | Might want to back off to -O2, at least when you compile the kernel if | nothing else. I believe the handbook recommends -O2, and optimising | too highly can lead to some pretty bizarre problems. Kernel compiles don't use CFLAGS from make.conf. Unless you edit the kernel makefiles yourself, a sane set will be used. If you *do* edit the makefiles, don't expect anyone to help when things go horribly wrong. -- Ciaran McCreesh, Gentoo XMLcracy Member G03X276 (Sparc, MIPS, Vim, si hoc legere scis nimium eruditionis habes) Mail: ciaranm at gentoo.org Web: http://dev.gentoo.org/~ciaranm [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 11:48 ` Josh Glover 2004-05-12 12:14 ` Ciaran McCreesh @ 2004-05-12 13:58 ` Kevin 2004-05-12 14:44 ` Chris Gianelloni ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Kevin @ 2004-05-12 13:58 UTC (permalink / raw To: Gentoo Dev On Wednesday 12 May 2004 07:48, Josh Glover wrote: > Quoth Kevin (Wed 2004-05-12 07:24:07AM -0400): > > > > Have you turned off hyperthreading? Why is it that only two CPUs > > show up? It looks like (flags include ht) the CPUs support > > hyperthreading... or am I way off base in drawing that conclusion > > here? > > No, the only part where you went off base was in assuming that I have > more than one *physical* CPU in the box. I have one, and with > hyperthreading turned on, it looks like two to the kernel. Oh.... Well, that seems like an important difference between yours and my arrangements. Are you running Gentoo on a box with more than one physical CPU? I had no problems running Gentoo on this box until after installing a second CPU. That's when the weirdness started. Is anyone here running Gentoo on a dual-physical CPU machine? What compiler flags are you using? > > > I dunno, why don't you post your CFLAGS and MAKEOPTS from your > > > make.conf here? > > > > Did already :) > > Lost them in the spew, sorry. > > > Here they are again: > > > > ACCEPT_KEYWORDS="x86" > > This is unnecessary. You only need to use ACCEPT_KEYWORDS with the > unstable keywords, ~x86 in your case. Ok. Thanks. > > > CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" > > Might want to back off to -O2, at least when you compile the kernel if > nothing else. I believe the handbook recommends -O2, and optimising too > highly can lead to some pretty bizarre problems. K. What CFLAGS are you using? (on both your single physical CPU box you described above and with any multiple physical CPU boxes) > > > CXXFLAGS="-O3 -march=pentium4 > > I am not sure if you are setting this, or if this is just emerge, but > you usually want to set CXXFLAGS="${CFLAGS}". I'll hafta look at that (well, I can't anymore). I thought I did have CXXFLAGS=${CFLAGS}". > > > FEATURES="autoaddcvs ccache distcc sandbox" Huh. That's odd. When I installed distcc, I changed the FEATURES as indicated in the docs to list distcc, but I didn't add any of those others. Wonder where they came from... > > autoaddcvs is a feature only for developers, you should not turn it on. K. > > > MAKEOPTS="-j3" > > You have four CPUs, so set this to -j5. I was thinking "number of physical CPUs + 1" here, but ok. > > > Thanks again for the thoughtful reply, Josh. > > Hey, I live to serve. :) :) Thanks. -- -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 13:58 ` Kevin @ 2004-05-12 14:44 ` Chris Gianelloni 2004-05-12 15:17 ` tom_gall [not found] ` <40A23987.9080104@gentoo.org> 2 siblings, 0 replies; 41+ messages in thread From: Chris Gianelloni @ 2004-05-12 14:44 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev [-- Attachment #1: Type: text/plain, Size: 1325 bytes --] On Wed, 2004-05-12 at 09:58, Kevin wrote: > Oh.... Well, that seems like an important difference between yours and my > arrangements. Are you running Gentoo on a box with more than one > physical CPU? I had no problems running Gentoo on this box until after > installing a second CPU. That's when the weirdness started. > > Is anyone here running Gentoo on a dual-physical CPU machine? What > compiler flags are you using? I am running Gentoo on several dual-CPU machines, and even one quad-CPU machine with no troubles at all. CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer" > > > FEATURES="autoaddcvs ccache distcc sandbox" > > Huh. That's odd. When I installed distcc, I changed the FEATURES as > indicated in the docs to list distcc, but I didn't add any of those > others. Wonder where they came from... They came from the defaults. Unless you set -autoaddcvs, etc in your FEATURES. There's no need to try to remove autoaddcvs, since it just won't work. > I was thinking "number of physical CPUs + 1" here, but ok. You would probably get a speed boost going with -j5 rather than -j3, but I would bench it myself before trusting anything from an external source. -- Chris Gianelloni Developer, Gentoo Linux Games Team Is your power animal a penguin? [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 13:58 ` Kevin 2004-05-12 14:44 ` Chris Gianelloni @ 2004-05-12 15:17 ` tom_gall 2004-05-13 11:06 ` Kevin [not found] ` <40A23987.9080104@gentoo.org> 2 siblings, 1 reply; 41+ messages in thread From: tom_gall @ 2004-05-12 15:17 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Greetings, Just to give this another perspective. On Wednesday, May 12, 2004, at 08:58 AM, Kevin wrote: > On Wednesday 12 May 2004 07:48, Josh Glover wrote: >> Quoth Kevin (Wed 2004-05-12 07:24:07AM -0400): >>> >>> Have you turned off hyperthreading? Why is it that only two CPUs >>> show up? It looks like (flags include ht) the CPUs support >>> hyperthreading... or am I way off base in drawing that conclusion >>> here? >> >> No, the only part where you went off base was in assuming that I have >> more than one *physical* CPU in the box. I have one, and with >> hyperthreading turned on, it looks like two to the kernel. > > Oh.... Well, that seems like an important difference between yours > and my > arrangements. Are you running Gentoo on a box with more than one > physical CPU? I had no problems running Gentoo on this box until after > installing a second CPU. That's when the weirdness started. > > Is anyone here running Gentoo on a dual-physical CPU machine? What > compiler flags are you using? All my ppc64 hardware is SMP, and runs gentoo just fine. I suspect you have some specific intel-ish problem. Could be BIOS or any other variety of problems. >>> CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" >> >> Might want to back off to -O2, at least when you compile the kernel if >> nothing else. I believe the handbook recommends -O2, and optimising >> too >> highly can lead to some pretty bizarre problems. > > K. What CFLAGS are you using? (on both your single physical CPU box > you > described above and with any multiple physical CPU boxes) CFLAGS="-O3 -mtune=power3 or -mtune=g5 " or just -mcpu=powerpc64 >>> MAKEOPTS="-j3" >> >> You have four CPUs, so set this to -j5. > > I was thinking "number of physical CPUs + 1" here, but ok. I generally set to number of physical CPUs x 2. For HMT systems (which include certain ppc64 boxes as well) number of physical CPUs x 4 is fine. Regards, Tom Tom Gall gentoo-ppc64 lead -- God started with stage 1, shouldn't you? tgall aatt gentoo.org tgall aatt uberh4x0r.org tom_gall aatt mac.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (Darwin) iD4DBQFAokAGNM6ZoaBWhQkRAhkFAJdpTF240+JymZBXCkgBuXmvQntBAJ9ovulG LLYj/Xm4J2rSNQqQ/h1VdQ== =JRtW -----END PGP SIGNATURE----- -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-12 15:17 ` tom_gall @ 2004-05-13 11:06 ` Kevin 2004-05-13 11:12 ` Senor Rodgman ` (3 more replies) 0 siblings, 4 replies; 41+ messages in thread From: Kevin @ 2004-05-13 11:06 UTC (permalink / raw To: Gentoo Dev On Wednesday 12 May 2004 11:17, tom_gall@mac.com wrote: > Greetings, > > Just to give this another perspective. > [...] Thanks for your reply, Tom. At least I know it should be doable. To all who've commented on this thread, thanks again. I've now tried a stage 3 installation booting the 2.6.1 SMP kernel from a 2004.0 LiveCD (the SMP configs on 2004.1 LiveCDs are all broken---see bug #49382). I had no lockup problems while running that kernel, but after rebooting with my kernel (gentoo-sources, built with Chris's CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer") I've already suffered two lockups. I set MAKEOPS="-j1" for safety. Something very weird here. Next, I'm going to try booting from the cd and chrooting into my system and then doing more extensive testing of the kernel on the cd, but I'm really running out of options here. I'll probably also try building another kernel with CFLAGS="-march=pentium3 -O2 -pipe". Any other suggestions? Does anyone think that my two CPUs having different stepping levels could have anything to do with this problem? One is level 7 and the other 9. Greg KH thinks it's bad memory, but I'm skeptical of that because the main address that fails (some 30 times in a row) is at 1023.8MB and the Dell Utilities only test up to 1022MB, and because I haven't seen the problem with the liveCD kernel. -- -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-13 11:06 ` Kevin @ 2004-05-13 11:12 ` Senor Rodgman 2004-05-13 13:04 ` Chris Gianelloni ` (2 subsequent siblings) 3 siblings, 0 replies; 41+ messages in thread From: Senor Rodgman @ 2004-05-13 11:12 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev On Thu, 13 May 2004, Kevin wrote: > I had no lockup problems while running that kernel, but after rebooting > with my kernel (gentoo-sources, built with Chris's > CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer") I've already > suffered two lockups. I set MAKEOPS="-j1" for safety. > Greg KH thinks it's bad memory, but I'm skeptical of that because the main > address that fails (some 30 times in a row) is at 1023.8MB and the Dell > Utilities only test up to 1022MB, and because I haven't seen the problem > with the liveCD kernel. I had some similar problems recently (on dual athlon), where it was running oldish kernels (2.5.4 I think) OK, but 2.6.5 & later wouldn't boot. Memtest reported bad memory; booting the new kernels with a suitable mem= parameter confirmed this (they then booted fine, and continue to be fine with replacement memory). So I recommend checking the memory. dave -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-13 11:06 ` Kevin 2004-05-13 11:12 ` Senor Rodgman @ 2004-05-13 13:04 ` Chris Gianelloni 2004-05-13 15:04 ` Daniel Drake 2004-05-13 15:54 ` Greg KH 3 siblings, 0 replies; 41+ messages in thread From: Chris Gianelloni @ 2004-05-13 13:04 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev [-- Attachment #1: Type: text/plain, Size: 2336 bytes --] On Thu, 2004-05-13 at 07:06, Kevin wrote: > I've now tried a stage 3 installation booting the 2.6.1 SMP kernel from a > 2004.0 LiveCD (the SMP configs on 2004.1 LiveCDs are all broken---see bug > #49382). I know this isn't exactly what you're looking for, but I have a CD (actually, a GameCD beta) available at http://dev.gentoo.org/~wolf31o2/x86-ut2004demo-20040420.iso that you could grab. It has only one kernel, and it is SMP. It has booted and worked successfully on every machine I have tried it on, and even has X+fluxbox on it. > I had no lockup problems while running that kernel, but after rebooting > with my kernel (gentoo-sources, built with Chris's > CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer") I've already > suffered two lockups. I set MAKEOPS="-j1" for safety. Copy the kernel from my CD and the /lib/modules/2.6.5-gentoo-r1 and see if that kernel works fine on your machine in your build environment. > Something very weird here. Next, I'm going to try booting from the cd and > chrooting into my system and then doing more extensive testing of the > kernel on the cd, but I'm really running out of options here. I'll > probably also try building another kernel with CFLAGS="-march=pentium3 > -O2 -pipe". Any other suggestions? Try my CD... it works in SMP. That will help test some of the problems, especially since the kernel you are "testing" with is not SMP, so you're not really testing anything. > Does anyone think that my two CPUs having different stepping levels could > have anything to do with this problem? One is level 7 and the other 9. It is possible that is causing the problem. You never really know. I *doubt* it should be a problem, unless one CPU is running out of spec. > Greg KH thinks it's bad memory, but I'm skeptical of that because the main > address that fails (some 30 times in a row) is at 1023.8MB and the Dell > Utilities only test up to 1022MB, and because I haven't seen the problem > with the liveCD kernel. It still could be bad memory. I think I would trust memtest86 before the Dell utilities. You could also try finding another bootable system checker. I'm sure there are plenty available. -- Chris Gianelloni Developer, Gentoo Linux Games Team Is your power animal a penguin? [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-13 11:06 ` Kevin 2004-05-13 11:12 ` Senor Rodgman 2004-05-13 13:04 ` Chris Gianelloni @ 2004-05-13 15:04 ` Daniel Drake 2004-05-13 15:54 ` Greg KH 3 siblings, 0 replies; 41+ messages in thread From: Daniel Drake @ 2004-05-13 15:04 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev Hi Kevin, Kevin wrote: > Greg KH thinks it's bad memory, but I'm skeptical of that because the main > address that fails (some 30 times in a row) is at 1023.8MB and the Dell > Utilities only test up to 1022MB, and because I haven't seen the problem > with the liveCD kernel. Although I've very rarely dealt with SMP systems, I've seen many unstable systems being diagnosed by various memory testing utilites as OK. As soon as you run memtest, errors come up, and replacing the faulty memory amazingly brings system stability again. If you RAM is always producing errors in the same place (and only in 1 place) then you might want to google for BadMem/BadRAM. These are two flavours of kernel patches which allow you to ask the kernel to ignore specific blocks of memory. You can even get memtest-x86 to output the exact parameters you need based on memory faults it finds. This should allow you to ignore the faulty part of the memory and continue on with the remaining ~1020mb or so. Daniel -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-13 11:06 ` Kevin ` (2 preceding siblings ...) 2004-05-13 15:04 ` Daniel Drake @ 2004-05-13 15:54 ` Greg KH 2004-05-18 8:29 ` Kevin 3 siblings, 1 reply; 41+ messages in thread From: Greg KH @ 2004-05-13 15:54 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote: > > Greg KH thinks it's bad memory, It's not only me, it's memtest86 saying it :) > but I'm skeptical of that because the main address that fails (some 30 > times in a row) is at 1023.8MB and the Dell Utilities only test up to > 1022MB, and because I haven't seen the problem with the liveCD kernel. Maybe that's the fault of the Dell utilities. Seriously, I trust memtest86 over any other vendor specific test. If you don't want to believe it, that's fine, but I would really consider fixing that issue before trying to point the finger at the kernel or the Gentoo install. thanks, greg k-h -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-13 15:54 ` Greg KH @ 2004-05-18 8:29 ` Kevin 2004-05-18 10:59 ` Alexander Futasz ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Kevin @ 2004-05-18 8:29 UTC (permalink / raw To: gentoo-dev Again, thanks to all who have commented on this thread. I've now done some more testing and have some other interesting (though also confusing) results to report. On Thursday 13 May 2004 11:54, Greg KH wrote: > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote: > > Greg KH thinks it's bad memory, > > It's not only me, it's memtest86 saying it :) True. Although it is locking up after only 1-2 minutes of operation. What conclusion should I draw from that? > > > but I'm skeptical of that because the main address that fails (some > > 30 times in a row) is at 1023.8MB and the Dell Utilities only test up > > to 1022MB, and because I haven't seen the problem with the liveCD > > kernel. > > Maybe that's the fault of the Dell utilities. Seriously, I trust > memtest86 over any other vendor specific test. If you don't want to > believe it, that's fine, but I would really consider fixing that issue > before trying to point the finger at the kernel or the Gentoo install. You're right, Greg. I finally took your advice and did some serious testing with the DIMM sticks. This box has 4 slots, DIMMA-DIMMD, and here's what I've done: 1) swapped one 512MB stick for the other in DIMMA/DIMMB (reversed their positions) 2) removed one 512MB stick from DIMMB (configs require filling from DIMMA up) 3) removed the other 512MB stick (so that now I've tried each stick in DIMMA all by itself, and no sticks in any of the other slots) 4) completely replaced each 512MB stick with new ones from Dell and did all of 1-3 above with the new sticks. In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all behave very similarly. That is, they show 1023.8MB (or 511.8MB if only one stick installed) as repeatedly failing (some 30 or 40 times), then they do either (a) show 304.5MB failing and three more failed tests of 1023.8MB (or 511.8MB) and then the program locks up; or (b) show three failed tests at 64.0MB, then three more at 1023.8MB (or 511.8MB), then one more failed test at 64.0MB, then one more at 0.6MB, then one more at 1023.8MB (or 511.8MB), and then the program locks up. Since I had the extra sticks, I also tried testing with all 4 slots filled and got very similar results to those described above, except the repeatedly failing address was 2047.8MB (in all cases, 512MB, 1024MB, and 2048MB, the repeately failing address is 0.2MB below the max). There are no intermittent failing addresses---there are two very specific patterns to the failures, and the program always locks up after following one pattern or the other. In all of the memory configurations I tried, the Dell utilities reported no memory errors (or any other hardware errors). Although I'm sure there are others here with more experience troubleshooting such problems, I'm thinking that the above is enough to base a pretty sound conclusion upon, and the conclusion I would draw is that hardware and memory are not the cause of these MCE problems. I welcome anyone contradicting that conclusion because I've never seen anything like this before and I'm at a loss on how to resolve it. I'm tempted to try replacing one of the CPUs to see if identical stepping levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are otherwise identical) will resolve the problem. I also tried getting memtest86 (and variants) to let me turn on the ECC portion of the tests to no avail, and when I tried sizing the memory, the probe returned 1024MB, the use bios std setting returned 1024MB, and the use bios all setting locked the program up. I also tried something else that had an enormous positive effect on the situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS and built another kernel with identical .config settings. With that kernel running, I did some 2-4 hours of solid compiling work, emerging and re-emerging packages like mysql, cyrus-sasl, cyrus-imapd, mit-krb5, openafs, etc. But unfortunately, this kernel also ended up freezing after doing more of the same, and it did so with the same error message MCE 0000000000000004. I tried using parsemce.c from http://www.codemonkey.org.uk/cruft/parsemce.c/. I built it and ran it, but it wasn't very helpful and I'm not quite sure what I'm supposed to do with it. Chris, I'm going to try your kernel. Thanks for offering that. I'll relate whatever I learn from that test. Again, I really appreciate all the thoughtful replies on what to try next to resolve this problem. If there are any others, or if anyone has suggestions on what to try next, I'd love to hear them. Perhaps I could send my .config file to someone and they could try cross-compiling a kernel for me to try running? Thanks again. -- -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-18 8:29 ` Kevin @ 2004-05-18 10:59 ` Alexander Futasz 2004-05-18 12:02 ` Josh Glover 2004-05-18 12:46 ` [gentoo-dev] " Daniel Drake 2 siblings, 0 replies; 41+ messages in thread From: Alexander Futasz @ 2004-05-18 10:59 UTC (permalink / raw To: gentoo-dev On Tue, 18 May 2004 04:29:58 -0400, Kevin wrote: > Again, thanks to all who have commented on this thread. I've now done > some more testing and have some other interesting (though also > confusing) results to report. > > On Thursday 13 May 2004 11:54, Greg KH wrote: > > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote: > > > Greg KH thinks it's bad memory, > > > > It's not only me, it's memtest86 saying it :) [...] > You're right, Greg. I finally took your advice and did some serious > testing with the DIMM sticks. [...] > In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all > behave very similarly. That is [...] the program locks up. > > In all of the memory configurations I tried, the Dell utilities > reported no memory errors (or any other hardware errors). I think you missed this one reply to your posts: http://article.gmane.org/gmane.linux.gentoo.devel/17942 -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-18 8:29 ` Kevin 2004-05-18 10:59 ` Alexander Futasz @ 2004-05-18 12:02 ` Josh Glover 2004-05-19 17:48 ` Kevin 2004-05-18 12:46 ` [gentoo-dev] " Daniel Drake 2 siblings, 1 reply; 41+ messages in thread From: Josh Glover @ 2004-05-18 12:02 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1683 bytes --] Quoth Kevin (Tue 2004-05-18 04:29:58AM -0400): > On Thursday 13 May 2004 11:54, Greg KH wrote: > > > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote: > > > > > Greg KH thinks it's bad memory, > > > > It's not only me, it's memtest86 saying it :) > > True. Although it is locking up after only 1-2 minutes of operation. > What conclusion should I draw from that? Bad system board. :( > Although I'm sure there are others here with more experience > troubleshooting such problems, I'm thinking that the above is enough to > base a pretty sound conclusion upon, and the conclusion I would draw is > that hardware and memory are not the cause of these MCE problems. Wrong. memtest86 giving you errors almost always indicates a hardware problem. You have changed the memory, but what remained consistent? The memory bus! Try a new system board. > I also tried something else that had an enormous positive effect on the > situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS All you have done is turn off SSE2 instructions and possibly a few others that the P4s have and the P3s do not. If something is wrong with your system board or CPU, less stress on the CPU is likely not to show problems as often. You have bad hardware, Kevin. Try the compile test with one CPU at a time (i.e. take one out), and if that is not illuminating, replace the system board. -- Josh Glover Gentoo Developer (http://dev.gentoo.org/~jmglov/) Tokyo Linux Users Group Listmaster (http://www.tlug.jp/) GPG keyID 0xDE8A3103 (C3E4 FA9E 1E07 BBDB 6D8B 07AB 2BF1 67A1 DE8A 3103) gpg --keyserver pgp.mit.edu --recv-keys DE8A3103 [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-18 12:02 ` Josh Glover @ 2004-05-19 17:48 ` Kevin 2004-05-20 12:19 ` [gentoo-dev] SOLVED: " Kevin 2004-05-20 21:16 ` Kevin 0 siblings, 2 replies; 41+ messages in thread From: Kevin @ 2004-05-19 17:48 UTC (permalink / raw To: gentoo-dev Thanks again for the replies, folks. Well, I've now replaced the system motherboard, the CPU (first tried removing one CPU and memtest86 behaved the exact same way, then replaced the CPU with the new one), and the RAM. Results: memtest86 and friends all behave the exact same way. Could this still be a hardware problem? I'm hard-pressed to believe that I have two different motherboards that just happen to suffer from the same flaw (they are not even the same exact version: one is version B2 and the other is version C4). The only things that are common between the system now and the system before are: (1) the SCSI controller card (RAID card) (another SCSI controller was replaced with the m/b), (2) 2 SCSI hard drives connected to the RAID card, (3) a PCI hardware controller based modem, and (4) the SCSI hot-plug backplane. Could one of these be causing the problem? I haven't tried reproducing my MCE 0004 error again, but memtest86 shows no difference. Can anyone buy into the notion now that memtest86 is doing something that it shouldn't be doing when testing this system? Again, the Dell Utilities are all turning up flawless. I've set the configuration in memtest86 to limit the address range it tests to those addresses below 1022MB or RAM (this is what the Dell utilities test with 1024MB RAM installed), but it ignores those limits and tests up to 1024 anyway and that's where it's still finding its errors (1023.8MB). I've configured memtest86 to turn on ECC testing and it refuses to do so (when I touch (8) for restart tests, the setting returns to off). What's going on here? Any thoughts are most welcome. I'll be trying to reproduce my MCE error with this new hardware, and I'll post results when I have them. Thanks again for all the replies. On Tuesday 18 May 2004 08:02, Josh Glover wrote: > Quoth Kevin (Tue 2004-05-18 04:29:58AM -0400): [...] > > True. Although it is locking up after only 1-2 minutes of operation. > > What conclusion should I draw from that? > > Bad system board. :( I just replaced it. Still does the same thing. > > > Although I'm sure there are others here with more experience > > troubleshooting such problems, I'm thinking that the above is enough > > to base a pretty sound conclusion upon, and the conclusion I would > > draw is that hardware and memory are not the cause of these MCE > > problems. > > Wrong. memtest86 giving you errors almost always indicates a hardware > problem. You have changed the memory, but what remained consistent? The > memory bus! Try a new system board. New system board includes a new memory bus. Still get the same results. > > > I also tried something else that had an enormous positive effect on > > the situation---I changed -march=pentium4 to -march=pentium3 in my > > CFLAGS > > All you have done is turn off SSE2 instructions and possibly a few > others that the P4s have and the P3s do not. If something is wrong with > your system board or CPU, less stress on the CPU is likely not to show > problems as often. That's a good point. I'll try reproducing the MCE now with the new hardware. > > You have bad hardware, Kevin. Try the compile test with one CPU at a > time (i.e. take one out), and if that is not illuminating, replace the > system board. Thanks again gents! -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-19 17:48 ` Kevin @ 2004-05-20 12:19 ` Kevin 2004-05-20 21:16 ` Kevin 1 sibling, 0 replies; 41+ messages in thread From: Kevin @ 2004-05-20 12:19 UTC (permalink / raw To: gentoo-dev Hi All- A final note to this thread. After trying for many hours of high-intensity cpu activity (like emerging many packages---which is what used to cause the MCE), since replacing my stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have two identical cpus, even in stepping levels (whereas this was not true before), I have been unable to reproduce my MCE 0004 error. I even did this with the kernel compiled with -march=pentium4 CFLAGS (that caused an MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and stepping level 9 cpus installed). Naturally, I'm delighted by this, however, the whole experience has been somewhat confusing (although enlightening in many respects). Memtest86 still behaves exactly as it did before the hardware replacement. Any thoughts on why it behaves this way with this hardware (unable to set address range limits, unable to force ECC testing on, program locks after 2 minutes of operation). I suppose that's a question for another thread in another forum. I seem to have suffered from no hardware failures on the M/B, the CPUs (one of the old CPUs is still present---I replaced the other), or the RAM (although I suppose the stepping level 7 Xeon might have had some incredibly subtle flaw that only showed up with another CPU present). The replacement hardware seems to suffer no problems at all, in spite of what Memtest86 does (fails at 1023.8MB 30 or 40 times and then freezes). I really appreciate all of the suggestions here. You guys convinced me that it was hardware which is why I replaced everything and that ultimately solved the problem, although it's not clear that there was really a hardware problem. The lesson I've learned (though I'm not sure this is really the root issue) is that when doing multi-processor computing, make sure that both processors are identical in every way. Any thoughts on the accuracy of this rule? But the bizarre thing is that I couldn't reproduce this MCE at all using another distribution on the same (pre-replacement) hardware. Does Gentoo push the hardware much harder than other distros? Perhaps because I'm compiling the code for my particular hardware vice running code that was built to run on many different sets of hardware (less aggressive CFLAGS et. al.)? I'm at a loss to explain this. Again, many thanks for all the help here. -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-19 17:48 ` Kevin 2004-05-20 12:19 ` [gentoo-dev] SOLVED: " Kevin @ 2004-05-20 21:16 ` Kevin 2004-05-20 21:32 ` Greg KH ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Kevin @ 2004-05-20 21:16 UTC (permalink / raw To: gentoo-dev Hi All- A final note to this thread. After trying for many hours of high-intensity cpu activity (like emerging many packages---which is what used to cause the MCE), since replacing my stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have two identical cpus, even in stepping levels (whereas this was not true before), I have been unable to reproduce my MCE 0004 error. I even did this with the kernel compiled with -march=pentium4 CFLAGS (that caused an MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and stepping level 9 cpus installed). Naturally, I'm delighted by this, however, the whole experience has been somewhat confusing (although enlightening in many respects). Memtest86 still behaves exactly as it did before the hardware replacement. Any thoughts on why it behaves this way with this hardware (unable to set address range limits, unable to force ECC testing on, program locks after 2 minutes of operation). I suppose that's a question for another thread in another forum. I seem to have suffered from no hardware failures on the M/B, the CPUs (one of the old CPUs is still present---I replaced the other), or the RAM (although I suppose the stepping level 7 Xeon might have had some incredibly subtle flaw that only showed up with another CPU present). The replacement hardware seems to suffer no problems at all, in spite of what Memtest86 does (fails at 1023.8MB 30 or 40 times and then freezes). I really appreciate all of the suggestions here. You guys convinced me that it was hardware which is why I replaced everything and that ultimately solved the problem, although it's not clear that there was really a hardware problem. The lesson I've learned (though I'm not sure this is really the root issue) is that when doing multi-processor computing, make sure that both processors are identical in every way. Any thoughts on the accuracy of this rule? But the bizarre thing is that I couldn't reproduce this MCE at all using another distribution on the same (pre-replacement) hardware. Does Gentoo push the hardware much harder than other distros? Perhaps because I'm compiling the code for my particular hardware vice running code that was built to run on many different sets of hardware (less aggressive CFLAGS et. al.)? I'm at a loss to explain this. Again, many thanks for all the help here. -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-20 21:16 ` Kevin @ 2004-05-20 21:32 ` Greg KH 2004-05-20 23:08 ` Robin H. Johnson 2004-05-21 13:05 ` Chris Gianelloni 2 siblings, 0 replies; 41+ messages in thread From: Greg KH @ 2004-05-20 21:32 UTC (permalink / raw To: Kevin; +Cc: gentoo-dev On Thu, May 20, 2004 at 05:16:25PM -0400, Kevin wrote: > The lesson I've learned (though I'm not sure > this is really the root issue) is that when doing multi-processor > computing, make sure that both processors are identical in every way. > Any thoughts on the accuracy of this rule? That is a _very_ good rule to stick with, I know many problems go away if you follow it. You never mentioned that this was the case with your hardware, or I would have mentioned it earlier, sorry. Glad it's all working for you, and you can stick with Gentoo :) thanks, greg k-h -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-20 21:16 ` Kevin 2004-05-20 21:32 ` Greg KH @ 2004-05-20 23:08 ` Robin H. Johnson 2004-05-20 23:16 ` Hasse Hagen Johansen 2004-05-21 13:05 ` Chris Gianelloni 2 siblings, 1 reply; 41+ messages in thread From: Robin H. Johnson @ 2004-05-20 23:08 UTC (permalink / raw To: Gentoo Developers [-- Attachment #1: Type: text/plain, Size: 2171 bytes --] On Thu, May 20, 2004 at 05:16:25PM -0400, Kevin wrote: > Hi All- > > A final note to this thread. > > After trying for many hours of high-intensity cpu activity (like emerging > many packages---which is what used to cause the MCE), since replacing my > stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have > two identical cpus, even in stepping levels (whereas this was not true > before), I have been unable to reproduce my MCE 0004 error. I even did > this with the kernel compiled with -march=pentium4 CFLAGS (that caused an > MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and > stepping level 9 cpus installed). If you'd pointed out your cpus were different in the first place, that would have been the first thing to change. the term is SMP - _Symmetrical_ Multi-Processing on purpose, the CPUs need to be identical. I'm surprised it even worked to the degree it did. > Does Gentoo push the hardware much harder than other distros? Yup, running with gentoo optimizations is a LOT harder on machines. > Perhaps because I'm compiling the code for my particular hardware vice > running code that was built to run on many different sets of hardware > (less aggressive CFLAGS et. al.)? I'm at a loss to explain this. As an example of this, with GCC maxed on out on '-O3 -march=pentium4 -fomit-frame-pointers', and trying the same CFLAGS to compile MySQL, I can crash GCC with some frequency on certain hardware. Yet 100% identical hardware in an adajcent server compiles the same fine. Both machines are 1U intel servers (single 2.66ghz p4 xeon cpu, dual cpu board, 1gb ram, 3ware raid1 - 40gb), from the same batch (sequential serial numbers). It boils down to the fact that the hardware shipped out is good enough to withstand the burn-in tests, but the acceptance point of the burn-in tests is lower than the stress placed on the machine by Gentoo. -- Robin Hugh Johnson E-Mail : robbat2@orbis-terrarum.net Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 ICQ# : 30269588 or 41961639 GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85 [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-20 23:08 ` Robin H. Johnson @ 2004-05-20 23:16 ` Hasse Hagen Johansen 2004-05-21 2:46 ` Kevin 0 siblings, 1 reply; 41+ messages in thread From: Hasse Hagen Johansen @ 2004-05-20 23:16 UTC (permalink / raw To: gentoo-dev >>>>> "Robin" == Robin H Johnson <robbat2@gentoo.org> writes: Robin> If you'd pointed out your cpus were different in the first Robin> place, that would have been the first thing to change. the Robin> term is SMP - _Symmetrical_ Multi-Processing on purpose, Robin> the CPUs need to be identical. I'm surprised it even worked Robin> to the degree it did. He did point it out early on :-) /Hasse -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-20 23:16 ` Hasse Hagen Johansen @ 2004-05-21 2:46 ` Kevin 0 siblings, 0 replies; 41+ messages in thread From: Kevin @ 2004-05-21 2:46 UTC (permalink / raw To: gentoo-dev On Thursday 20 May 2004 19:16, Hasse Hagen Johansen wrote: > >>>>> "Robin" == Robin H Johnson <robbat2@gentoo.org> writes: > > Robin> If you'd pointed out your cpus were different in the first > Robin> place, that would have been the first thing to change. the > Robin> term is SMP - _Symmetrical_ Multi-Processing on purpose, > Robin> the CPUs need to be identical. I'm surprised it even worked > Robin> to the degree it did. > > He did point it out early on :-) Thanks, Hasse. Glad somebody noticed. :-) Greg/Robin, in my defense, I feel I must point out: It was in my first post (/proc/cpuinfo output), again here: On Thursday 13 May 2004 07:06, Kevin wrote: > Does anyone think that my two CPUs having different stepping levels > could have anything to do with this problem? One is level 7 and the > other 9. and again here: On Tuesday 18 May 2004 04:29, Kevin wrote: > anything like this before and I'm at a loss on how to resolve it. I'm > tempted to try replacing one of the CPUs to see if identical stepping > levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are > otherwise identical) will resolve the problem. Thanks again for all the help, folks, and I too am extremely delighted that I can stay with Gentoo. Over the past 24+ hours as I've been pushing this box to the limit with emerge this and emerge that, upgrading major packages at the snap of a finger, having two different versions of some packages installed in different slots, and building everything from source, it's easy to remember why I struggled so hard to stay with Gentoo. It really does represent a terrific improvement on the standard distros. -- -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels 2004-05-20 21:16 ` Kevin 2004-05-20 21:32 ` Greg KH 2004-05-20 23:08 ` Robin H. Johnson @ 2004-05-21 13:05 ` Chris Gianelloni 2 siblings, 0 replies; 41+ messages in thread From: Chris Gianelloni @ 2004-05-21 13:05 UTC (permalink / raw To: Kevin; +Cc: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1008 bytes --] On Thu, 2004-05-20 at 17:16, Kevin wrote: > But the bizarre thing is that I couldn't reproduce this MCE at all using > another distribution on the same (pre-replacement) hardware. Does Gentoo > push the hardware much harder than other distros? Perhaps because I'm > compiling the code for my particular hardware vice running code that was > built to run on many different sets of hardware (less aggressive CFLAGS > et. al.)? I'm at a loss to explain this. Simply... Yes. You are using, even at -march=pentium3, the MMX and SSE portions of the chip, which may not be used at all on another distribution (compiled -march=i586) at all. CFLAGS, as both the Gnome and KDE teams can attest, can make a world of difference on how things come out in the end. As for the memtest86 problem, who knows... ask the memtest86 guys. They'd probably be really interested in your findings. -- Chris Gianelloni Developer Games/LiveCD Teams Gentoo Linux Is your power animal a penguin? [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels 2004-05-18 8:29 ` Kevin 2004-05-18 10:59 ` Alexander Futasz 2004-05-18 12:02 ` Josh Glover @ 2004-05-18 12:46 ` Daniel Drake 2 siblings, 0 replies; 41+ messages in thread From: Daniel Drake @ 2004-05-18 12:46 UTC (permalink / raw To: Kevin; +Cc: gentoo-dev Hi, Kevin wrote: > Although I'm sure there are others here with more experience > troubleshooting such problems, I'm thinking that the above is enough to > base a pretty sound conclusion upon, and the conclusion I would draw is > that hardware and memory are not the cause of these MCE problems. I > welcome anyone contradicting that conclusion because I've never seen > anything like this before and I'm at a loss on how to resolve it. I'm > tempted to try replacing one of the CPUs to see if identical stepping > levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are > otherwise identical) will resolve the problem. I have seen similar behaviour in previous experience (with uniprocessor boards), i.e. all memory sticks (even confirmed good ones) bring up errors in the same place when plugged into the board in question. I've never attempted to look in detail for the cause of the problem, you'd suspect a faulty memory controller of some sort. In my experience, I've just replaced the board, and that has solved the problem. Daniel -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <40A23987.9080104@gentoo.org>]
* [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) [not found] ` <40A23987.9080104@gentoo.org> @ 2004-05-12 16:22 ` Kevin 2004-05-12 16:59 ` Greg KH 2004-05-12 17:15 ` Sven Vermeulen 0 siblings, 2 replies; 41+ messages in thread From: Kevin @ 2004-05-12 16:22 UTC (permalink / raw To: Gentoo Dev Thank you Chris, Bret, and Heiko for your replies, both on- and off-list. Your replies all look like good suggestions and I'm going to try them, but first, I have to ask one other thing. As I said earlier in this thread, I found zero errors in an 8 hour exhaustive memory test (using the Dell-provided Utility Partition tests) running 15-16 loops, 4 different tests, and found zero errors from a complete hardware test (also from the Utility Partition). But just on a lark, I decided to try memtest86 3.0 and 3.1a as well, and both are turning up errors all over the place. I'm skeptical that memtest86 is giving me accurate information because (a) it's finding so many, (b) I can't seem to get it to turn on ECC mode (yet the Dell utilities did test this, and the CMOS reports that it is ECC memory), (c) it runs for only 10 seconds or so and begins finding errors, and (d) it locks up after about 2-5 minutes. Last time I ran it, the error count was up to 109 after 5 minutes and then it locked up. Any thoughts on this? Is this bad memory (in spite of the Dell tests all turning up flawless) or is memtest86 getting something wrong (like amount of installed memory to test)? Any way to tell for sure? I've looked at the docs for memtest86 and they talk about the possibility of memtest86 incorrectly determining the amount of memory to test and that seems likely in my case. The first 10 or so errors are all at the same address (0003fffdc80) 1023.8MB, then there are 4 or so at 64.0MB, then 1 or 2 at 0.6MB, then it locks up. The Dell utilities report testing only up to 1022MB of memory. Is this a case of memtest86 getting the installed memory count wrong? When I look at the CMOS/BIOS settings, the System memory is 1024MB, and that's what the Dell utilities also report initially, but the tests themselves are only being run on the first 1022MB, according to the test reports. Thanks. -Kevin -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) 2004-05-12 16:22 ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin @ 2004-05-12 16:59 ` Greg KH 2004-05-12 17:18 ` Scott Myron 2004-05-12 17:15 ` Sven Vermeulen 1 sibling, 1 reply; 41+ messages in thread From: Greg KH @ 2004-05-12 16:59 UTC (permalink / raw To: Kevin; +Cc: Gentoo Dev On Wed, May 12, 2004 at 12:22:15PM -0400, Kevin wrote: > > Any thoughts on this? Trust memtest86, it is known to exercise memory quite well, and finds real errors. I wouldn't trust the dell "tests" at all, as who knows what they are really testing... Sounds like you have hardware problems. greg k-h -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) 2004-05-12 16:59 ` Greg KH @ 2004-05-12 17:18 ` Scott Myron 0 siblings, 0 replies; 41+ messages in thread From: Scott Myron @ 2004-05-12 17:18 UTC (permalink / raw To: gentoo-dev; +Cc: Kevin Greg KH wrote: > On Wed, May 12, 2004 at 12:22:15PM -0400, Kevin wrote: > >>Any thoughts on this? > You may also want to try removing all but one stick of memory, and rerun memtest. If that fails, replace that with another stick of memory, and rerun the test again. This will help you find out which stick of memory is bad. If you only have one stick of memory, borrow some from a friend, if possible... You might also want to try testing your memory in a friend's machine, to verify the results. It's possible that there is a problem with the traces on the motherboard from the northbridge to the memory slots(unlikely, but possible). Scott -- gentoo-dev@gentoo.org mailing list ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) 2004-05-12 16:22 ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin 2004-05-12 16:59 ` Greg KH @ 2004-05-12 17:15 ` Sven Vermeulen 1 sibling, 0 replies; 41+ messages in thread From: Sven Vermeulen @ 2004-05-12 17:15 UTC (permalink / raw To: Gentoo Dev [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1182 bytes --] On Wed, May 12, 2004 at 12:22:15PM -0400, Kevin wrote: > As I said earlier in this thread, I found zero errors in an 8 hour > exhaustive memory test (using the Dell-provided Utility Partition tests) > running 15-16 loops, 4 different tests, and found zero errors from a > complete hardware test (also from the Utility Partition). > > But just on a lark, I decided to try memtest86 3.0 and 3.1a as well, and > both are turning up errors all over the place. We had exactly the same error; PowerEdge 1600 SC. I thought it was because of Dell's memory chips, but when I sticked others in (dunno what vendor though) I received the same issues (with memtest86) again. A different motherbord resolved the issue. I never had the MCE errors though (but then again, it didn't - and still doesn't - run Gentoo). Just my 2 ¢, Sven Vermeulen -- Bent Hindrup Andersen, Danish MEP, about the Software Patent Directive: The approach of the Commission and Council in this directive is shocking. They are making full use of all the possibilities of evading democracy that the current Community Law provides. <http://lwn.net/Articles/84009/> [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2004-05-26 5:18 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin 2004-05-11 18:46 ` Greg KH 2004-05-11 18:55 ` Kevin 2004-05-11 19:04 ` Greg KH 2004-05-11 19:38 ` Kevin 2004-05-11 20:54 ` Chris Gianelloni 2004-05-11 21:31 ` Kevin 2004-05-11 19:38 ` Paul de Vrieze 2004-05-11 21:37 ` Kevin 2004-05-12 1:02 ` Georgi Georgiev 2004-05-12 10:23 ` [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels) sf 2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover 2004-05-12 9:31 ` Dan Podeanu 2004-05-12 11:26 ` Kevin 2004-05-12 11:24 ` Kevin 2004-05-12 11:48 ` Josh Glover 2004-05-12 12:14 ` Ciaran McCreesh 2004-05-12 13:58 ` Kevin 2004-05-12 14:44 ` Chris Gianelloni 2004-05-12 15:17 ` tom_gall 2004-05-13 11:06 ` Kevin 2004-05-13 11:12 ` Senor Rodgman 2004-05-13 13:04 ` Chris Gianelloni 2004-05-13 15:04 ` Daniel Drake 2004-05-13 15:54 ` Greg KH 2004-05-18 8:29 ` Kevin 2004-05-18 10:59 ` Alexander Futasz 2004-05-18 12:02 ` Josh Glover 2004-05-19 17:48 ` Kevin 2004-05-20 12:19 ` [gentoo-dev] SOLVED: " Kevin 2004-05-20 21:16 ` Kevin 2004-05-20 21:32 ` Greg KH 2004-05-20 23:08 ` Robin H. Johnson 2004-05-20 23:16 ` Hasse Hagen Johansen 2004-05-21 2:46 ` Kevin 2004-05-21 13:05 ` Chris Gianelloni 2004-05-18 12:46 ` [gentoo-dev] " Daniel Drake [not found] ` <40A23987.9080104@gentoo.org> 2004-05-12 16:22 ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin 2004-05-12 16:59 ` Greg KH 2004-05-12 17:18 ` Scott Myron 2004-05-12 17:15 ` Sven Vermeulen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox