* [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
@ 2004-05-11 18:07 Kevin
2004-05-11 18:46 ` Greg KH
2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
0 siblings, 2 replies; 41+ messages in thread
From: Kevin @ 2004-05-11 18:07 UTC (permalink / raw
To: Gentoo Dev
Hi All-
I'm writing here first before reporting a bug because perhaps I'm missing
something important here (and because I'm not sure what details to supply
if I do report a bug because I'm not sure if the problem lies with the
gentoo kernels or with gcc or something else). If I am missing
something, however, I'm not the only Gentoo user who's missing it, so I
think that's unlikely. I saw a thread on lkml in March from somebody
else with extremely similar circumstances---though not identical---and
running Gentoo---he thought it was a kernel bug but I don't think so:
see
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=ISO-8859-1&threadm=1yyJD-8mD-11%
40gated-at.bofh.it&rnum=6&prev=/groups%3Fq%3Dgroup:linux.kernel%2Bsmp%
2Bgentoo%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DISO-8859-1%26sa%3DG%
26scoring%3Dd
or search for "group:linux.kernel smp gentoo" on google groups,
or see lkml thread: SMP + Hyperthreading / Asus PCDL Deluxe / Kernel 2.4.x
2.6.x / Crash/Freeze).
Instead, I think the most likely explanation for my problem is a bug in
some Gentoo code somewhere, perhaps related to building kernels, but
maybe not... Maybe related to building gcc itself? Not sure.
In summary, my problem is this: of those that I've tried, I can't get any
Gentoo kernel to handle SMP operation during major CPU activity (like
emerging packages) for more than about 5 or 10 minutes. Invariably,
during such activity, I get a kernel panic---most often with words on the
console about Machine Check Exception 000000...004 (this number from
memory so it may be off).
The only way that I can get reliable, stable operation with a Gentoo
kernel and distribution is if I build a kernel without support for SMP.
This is stable with or without hyperthreading enabled in CMOS. Over the
last week or so, I've tried running kernels with the CMOS setting for
hyperthreading disabled and enabled, with support for SMP enabled and
disabled, in all combinations and for the latest stable ebuilds of the
following Gentoo kernels: vanilla-sources, gentoo-sources,
gentoo-dev-sources, gs-sources (actually, I couldn't even get this to
build---see bug #48973 and thread here: gs-sources problems: device
mapper: dm.o has undeclared identifiers). Going by memory, I tried
kernel versions 2.4.25 (vanilla?), 2.4.26 (gentoo?), and 2.6.5
(gentoo-dev?).
In all of the above circumstances, when running a kernel with support for
SMP and when emerging packages (some pretty small ones, so only about 3
or 5 minutes of compiling), the machine would lock with a kernel panic
and need a hard reset.
The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID
controller (using AMI megaraid2 driver) and a LSI Logic Corp controller
(using Fusion MPT base driver) for the SCSI DAT and with dual 2.4GHz Xeon
processors, each having a 512KB L2 Cache.
Output from /proc/cpuinfo is:
=======
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 7
cpu MHz : 2392.127
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 4771.02
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 7
cpu MHz : 2392.127
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 4771.02
processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 9
cpu MHz : 2392.127
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 4771.02
processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 9
cpu MHz : 2392.127
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 4771.02
=======
I wrote some details about this problem in gentoo-user under the thread:
2004.1 and SMP Problems, but since then, have done lots more testing.
The reason that I think this is a Gentoo thing and not a kernel thing is
that today I just finished installing SuSE9 on this same machine with
CMOS hyperthreading setting enabled and the CPUs have been wailing away
for hours doing simultaneous builds of several different source tarballs
(bind9, kde3.2.2, mysql 4.0.18), and I haven't seen even a single
problem. During these tests, I was running the SuSE kernel
2.4.21-215-smp4G.
In SuSE, the output of /proc/cpuinfo is close, but not exactly the same as
above. There are some differences in the flags and a couple other things
(use diff for specifics).
SuSE /proc/cpuinfo:
=======
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 7
cpu MHz : 2392.795
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4718.59
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 7
cpu MHz : 2392.795
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4767.74
processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 9
cpu MHz : 2392.795
cache size : 512 KB
physical id : 2
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4767.74
processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 9
cpu MHz : 2392.795
cache size : 512 KB
physical id : 2
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4767.74
=======
Gentoo emerge info output:
=======
System uname: 2.4.25-gentoo-r2 i686 Intel(R) Xeon(TM) CPU 2.40GHz
Gentoo Base System version 1.4.9
distcc 2.13 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632)
[enabled]
ccache version 2.3 [enabled]
Autoconf: sys-devel/autoconf-2.58-r1
Automake: sys-devel/automake-1.8.3
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
CHOST="i686-pc-linux-gnu"
COMPILER="gcc3"
CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.2/share/config /usr/kde/3/share/config /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/afs/C /etc/afs/afsws /etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoaddcvs ccache distcc sandbox"
GENTOO_MIRRORS="http://128.213.5.34/gentoo/
http://mirror.datapipe.net/gentoo
ftp://mirrors.sec.informatik.tu-darmstadt.de/gentoo/
http://gentoo.eliteitminds.com"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY=""
SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage"
USE="X Xaw3d acl acpi afs alsa apache2 apm arts avi berkdb bonobo caps
crypt
cups doc emacs emacs-w3 encode esd ethereal evo firebird flac foomaticdb
gdbm
gif gnome gpm gstreamer gtk gtk2 gtkhtml guile hardened icq imagemagick
imap
imlib innodb ipv6 jabber jack java jikes jpeg kde kerberos krb4 ldap
libg++
libwww mad mcal mikmod motif mozilla mpeg mysql ncurses nls odbc oggvorbis
opengl oss pam pda pdflib perl plotutils png ppds prelude python qt
quicktime
readline ruby samba sasl sdl slang slp spell sse ssl svga tcltk tcpd tetex
tiff
truetype unicode usb vhosts x86 xinerama xml2 xmms xv zeo zlib"
=======
I'm a recent Gentoo convert. I think it's an excellent improvement on the
traditional Linux distros, and I'd really like to use it on my server,
but as long as this problem with SMP is present, I just can't.
If anyone has any suggestions on what I might be doing wrong and how I can
get a stable gentoo system with full support for SMP (ideally in a 2.4.x
kernel since I need OpenAFS and OpenAFS doesn't work with 2.6.x right
now; nor for the near future say the developers of OAFS), I would really
appreciate getting your thoughts.
Thanks in advance.
--
-Kevin
PS. FWIW, I'll add that I have a very vague memory while watching text fly
up the screen during bootstrap.sh or emerge system (this was a stage 1
install before I installed SuSE over it) of seeing some warning about
something being unsafe with SMP. Do I need to have some setting or other
turned off for some parts of a stage 1 install with a dual CPU system?
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin
@ 2004-05-11 18:46 ` Greg KH
2004-05-11 18:55 ` Kevin
2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
1 sibling, 1 reply; 41+ messages in thread
From: Greg KH @ 2004-05-11 18:46 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote:
> In summary, my problem is this: of those that I've tried, I can't get any
> Gentoo kernel to handle SMP operation during major CPU activity (like
> emerging packages) for more than about 5 or 10 minutes. Invariably,
> during such activity, I get a kernel panic---most often with words on the
> console about Machine Check Exception 000000...004 (this number from
> memory so it may be off).
This means you have bad hardware (memory, cpu, overheating, etc.) It's
the hardware saying that something bad just happened, nothing the OS or
distro did wrong here.
I suggest you track down that exact error message and determine what it
means (there are tables and a tool that does that, sorry I can't
remember what it is...)
Good luck,
greg k-h
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 18:46 ` Greg KH
@ 2004-05-11 18:55 ` Kevin
2004-05-11 19:04 ` Greg KH
2004-05-11 19:38 ` Paul de Vrieze
0 siblings, 2 replies; 41+ messages in thread
From: Kevin @ 2004-05-11 18:55 UTC (permalink / raw
To: gentoo-dev
On Tuesday 11 May 2004 14:46, Greg KH wrote:
> On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote:
> > In summary, my problem is this: of those that I've tried, I can't
> > get any Gentoo kernel to handle SMP operation during major CPU
> > activity (like emerging packages) for more than about 5 or 10
> > minutes. Invariably, during such activity, I get a kernel
> > panic---most often with words on the console about Machine Check
> > Exception 000000...004 (this number from memory so it may be off).
>
> This means you have bad hardware (memory, cpu, overheating, etc.)
> It's the hardware saying that something bad just happened, nothing
> the OS or distro did wrong here.
Thanks for your reply, Greg. Although what you say here may be true in
some circumstances, I think you're wrong in this case. You may have
stopped reading after the above paragraph, but in the rest of my post,
I describe how a SuSE9 distro installed on this same hardware has no
problems doing all of the things that failed in Gentoo. That's a
pretty strong indication that there are no hardware problems, isn't it?
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 18:55 ` Kevin
@ 2004-05-11 19:04 ` Greg KH
2004-05-11 19:38 ` Kevin
2004-05-11 19:38 ` Paul de Vrieze
1 sibling, 1 reply; 41+ messages in thread
From: Greg KH @ 2004-05-11 19:04 UTC (permalink / raw
To: Kevin; +Cc: gentoo-dev
On Tue, May 11, 2004 at 02:55:47PM -0400, Kevin wrote:
> On Tuesday 11 May 2004 14:46, Greg KH wrote:
> > On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote:
> > > In summary, my problem is this: of those that I've tried, I can't
> > > get any Gentoo kernel to handle SMP operation during major CPU
> > > activity (like emerging packages) for more than about 5 or 10
> > > minutes. Invariably, during such activity, I get a kernel
> > > panic---most often with words on the console about Machine Check
> > > Exception 000000...004 (this number from memory so it may be off).
> >
> > This means you have bad hardware (memory, cpu, overheating, etc.)
> > It's the hardware saying that something bad just happened, nothing
> > the OS or distro did wrong here.
>
> Thanks for your reply, Greg. Although what you say here may be true in
> some circumstances, I think you're wrong in this case. You may have
> stopped reading after the above paragraph, but in the rest of my post,
> I describe how a SuSE9 distro installed on this same hardware has no
> problems doing all of the things that failed in Gentoo. That's a
> pretty strong indication that there are no hardware problems, isn't it?
Not at all. Different compilers/kernels/programs exercise hardware in
very different ways. It could be that your compiler settings for Gentoo
causes different instructions to be used for the same program on SuSE.
Try running memtest86 overnight as a good start to rule out your memory.
Good luck,
greg k-h
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 18:55 ` Kevin
2004-05-11 19:04 ` Greg KH
@ 2004-05-11 19:38 ` Paul de Vrieze
2004-05-11 21:37 ` Kevin
1 sibling, 1 reply; 41+ messages in thread
From: Paul de Vrieze @ 2004-05-11 19:38 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1437 bytes --]
On Tuesday 11 May 2004 20:55, Kevin wrote:
> On Tuesday 11 May 2004 14:46, Greg KH wrote:
> > On Tue, May 11, 2004 at 02:07:58PM -0400, Kevin wrote:
> > > In summary, my problem is this: of those that I've tried, I can't
> > > get any Gentoo kernel to handle SMP operation during major CPU
> > > activity (like emerging packages) for more than about 5 or 10
> > > minutes. Invariably, during such activity, I get a kernel
> > > panic---most often with words on the console about Machine Check
> > > Exception 000000...004 (this number from memory so it may be off).
> >
> > This means you have bad hardware (memory, cpu, overheating, etc.)
> > It's the hardware saying that something bad just happened, nothing
> > the OS or distro did wrong here.
>
> Thanks for your reply, Greg. Although what you say here may be true in
> some circumstances, I think you're wrong in this case. You may have
> stopped reading after the above paragraph, but in the rest of my post,
> I describe how a SuSE9 distro installed on this same hardware has no
> problems doing all of the things that failed in Gentoo. That's a
> pretty strong indication that there are no hardware problems, isn't it?
Do you also have errors when you run a vanilla kernel? What if you take the
kernel from SUSE, which compiler do you use?
Paul
--
Paul de Vrieze
Gentoo Developer
Mail: pauldv@gentoo.org
Homepage: http://www.devrieze.net
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 19:04 ` Greg KH
@ 2004-05-11 19:38 ` Kevin
2004-05-11 20:54 ` Chris Gianelloni
0 siblings, 1 reply; 41+ messages in thread
From: Kevin @ 2004-05-11 19:38 UTC (permalink / raw
To: gentoo-dev
On Tuesday 11 May 2004 15:04, Greg KH wrote:
> On Tue, May 11, 2004 at 02:55:47PM -0400, Kevin wrote:
> > Thanks for your reply, Greg. Although what you say here may be
> > true in some circumstances, I think you're wrong in this case. You
> > may have stopped reading after the above paragraph, but in the rest
> > of my post, I describe how a SuSE9 distro installed on this same
> > hardware has no problems doing all of the things that failed in
> > Gentoo. That's a pretty strong indication that there are no
> > hardware problems, isn't it?
>
> Not at all. Different compilers/kernels/programs exercise hardware
> in very different ways. It could be that your compiler settings for
> Gentoo causes different instructions to be used for the same program
> on SuSE.
>
> Try running memtest86 overnight as a good start to rule out your
> memory.
Ok. Thanks for the suggestion. But what about this: Dell has a utility
partition and some programs for doing exhaustive testing of all the
hardware in the server. If I run the most thorough set of tests
available in this utility partition and I get a clean bill of health,
is that a reliable indication that there are no hardware problems? Or
does memtest86 do testing that's more exhaustive than most such utility
suites?
If the utility partition testing says all is well (I've done it several
times in the last month or so, though maybe not the most extensive
tests), what's the next place to look for an explanation of why this
MCE is happening in Gentoo but not in SuSE?
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 19:38 ` Kevin
@ 2004-05-11 20:54 ` Chris Gianelloni
2004-05-11 21:31 ` Kevin
0 siblings, 1 reply; 41+ messages in thread
From: Chris Gianelloni @ 2004-05-11 20:54 UTC (permalink / raw
To: Kevin; +Cc: gentoo-dev
On Tue, 2004-05-11 at 15:38, Kevin wrote:
> Ok. Thanks for the suggestion. But what about this: Dell has a utility
> partition and some programs for doing exhaustive testing of all the
> hardware in the server. If I run the most thorough set of tests
> available in this utility partition and I get a clean bill of health,
> is that a reliable indication that there are no hardware problems? Or
> does memtest86 do testing that's more exhaustive than most such utility
> suites?
I think the Dell suite would be more extensive.
> If the utility partition testing says all is well (I've done it several
> times in the last month or so, though maybe not the most extensive
> tests), what's the next place to look for an explanation of why this
> MCE is happening in Gentoo but not in SuSE?
Are you sure that it isn't MCE *causing* these problems? Have you tried
turning it off and seeing if you still have the same kinds of problems?
--
Chris Gianelloni
Developer, Gentoo Linux
Games Team
Is your power animal a penguin?
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 20:54 ` Chris Gianelloni
@ 2004-05-11 21:31 ` Kevin
0 siblings, 0 replies; 41+ messages in thread
From: Kevin @ 2004-05-11 21:31 UTC (permalink / raw
To: gentoo-dev
On Tuesday 11 May 2004 16:54, Chris Gianelloni wrote:
> On Tue, 2004-05-11 at 15:38, Kevin wrote:
> > Ok. Thanks for the suggestion. But what about this: Dell has a
> > utility partition and some programs for doing exhaustive testing of
> > all the hardware in the server. If I run the most thorough set of
> > tests available in this utility partition and I get a clean bill of
> > health, is that a reliable indication that there are no hardware
> > problems? Or does memtest86 do testing that's more exhaustive than
> > most such utility suites?
>
> I think the Dell suite would be more extensive.
Thanks for saying so, Chris.
>
> > If the utility partition testing says all is well (I've done it
> > several times in the last month or so, though maybe not the most
> > extensive tests), what's the next place to look for an explanation
> > of why this MCE is happening in Gentoo but not in SuSE?
>
> Are you sure that it isn't MCE *causing* these problems? Have you
> tried turning it off and seeing if you still have the same kinds of
> problems?
I'm not sure I understand what you mean by that. The first time I got a
kernel panic and MCE, I believe that the kernel I was running had no
configured capability to deal with MCE errors (though I'm not sure of
that). I had never seen an MCE before, but after this first time, with
any other kernels I built, I searched through the .config file options
for handlers of MCE errors and built them into the kernel where they
were available. IIRC, then when I got a kernel panic with those
kernels, I had some more information (apparently generated by the
kernel) on the console than I did with the first MCE. I add this
information in case it relates to your question or point here, but I'm
really not sure what you mean by, "Have you tried turning it off..."
Where do I turn it off? Do you mean the .config file parameter in the
kernel configuration process that builds (or not) a handler for the MCE
errors? Or do you mean something else?
Honestly, I'm thinking that I may have somehow built some software
(during the stage 1 installation process) that is causing these
problems, but I followed the Gentoo Handbook for doing a stage 1
installation pretty rigidly, so I'm not sure what I might have done to
cause that. When I did the bootstrap.sh and emerge system, I was
running the kernel that I booted from the boot CD (2004.0 I think, and
probably even the smp kernel that was on that CD---IIRC, the 2004.1
boot CD has some problems that prevent the use of the smp kernel on
that CD).
In fact, now that I think of it, I'm pretty sure I didn't get any MCE
kernel panics until after I finished emerge system and other tasks and
then rebooted my new Gentoo system. Perhaps this helps isolate the
cause of the problems. While I was doing the bootstrap.sh and emerge
system, it's definitely true that I was stressing the system out with
lots of compile jobs (which is what has been triggering my MCEs), but
I'm pretty sure I did not get any MCE failures during those steps.
Does this help someone figure out what's going on in my case?
Are there some compiler flags or other configurable settings that, if
set to certain values during the bootstrap.sh or emerge system steps,
could end up generating software (perhaps when I built my own gcc?)
that would cause these MCEs to be thrown?
Like I said in my PS in my first post, I have this vague memory of
seeing something that said, such-and-such is not smp safe. Have no
clue what that might have been now, though, or even if it's an accurate
memory. Some of this work was done in the wee hours...
Thanks for the replies and any other suggestions.
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 19:38 ` Paul de Vrieze
@ 2004-05-11 21:37 ` Kevin
2004-05-12 1:02 ` Georgi Georgiev
0 siblings, 1 reply; 41+ messages in thread
From: Kevin @ 2004-05-11 21:37 UTC (permalink / raw
To: gentoo-dev
On Tuesday 11 May 2004 15:38, Paul de Vrieze wrote:
> On Tuesday 11 May 2004 20:55, Kevin wrote:
> >
> > Thanks for your reply, Greg. Although what you say here may be
> > true in some circumstances, I think you're wrong in this case. You
> > may have stopped reading after the above paragraph, but in the rest
> > of my post, I describe how a SuSE9 distro installed on this same
> > hardware has no problems doing all of the things that failed in
> > Gentoo. That's a pretty strong indication that there are no
> > hardware problems, isn't it?
>
> Do you also have errors when you run a vanilla kernel?
Yes, I mentioned that in my first post. Or by vanilla do you mean a
kernel from kernel.org (as opposed to the Gentoo vanilla-sources
kernel---that's the one I tried; isn't that identical to the kernel
from kernel.org?)
> What if you
> take the kernel from SUSE,
I haven't tried installing Gentoo with my SuSE kernel running. Huh...
what a concept. With all the modularity of those default distro
kernels, would that even work? Maybe I'd need the kernel, the
System.map, and the /lib/modules/`uname -r` directory?
> which compiler do you use?
I built the standard compiler that you get with
ACCEPT_KEYWORDS="x86" (stable). gcc and friends. Whatever is the
standard stable ebuild is the one I built.
Thanks.
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 21:37 ` Kevin
@ 2004-05-12 1:02 ` Georgi Georgiev
2004-05-12 10:23 ` [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels) sf
0 siblings, 1 reply; 41+ messages in thread
From: Georgi Georgiev @ 2004-05-12 1:02 UTC (permalink / raw
To: gentoo-dev
maillog: 11/05/2004-17:37:39(-0400): Kevin types
> On Tuesday 11 May 2004 15:38, Paul de Vrieze wrote:
> > On Tuesday 11 May 2004 20:55, Kevin wrote:
> > >
> > > Thanks for your reply, Greg. Although what you say here may be
> > > true in some circumstances, I think you're wrong in this case. You
> > > may have stopped reading after the above paragraph, but in the rest
> > > of my post, I describe how a SuSE9 distro installed on this same
> > > hardware has no problems doing all of the things that failed in
> > > Gentoo. That's a pretty strong indication that there are no
> > > hardware problems, isn't it?
> >
> > Do you also have errors when you run a vanilla kernel?
>
> Yes, I mentioned that in my first post. Or by vanilla do you mean a
> kernel from kernel.org (as opposed to the Gentoo vanilla-sources
> kernel---that's the one I tried; isn't that identical to the kernel
> from kernel.org?)
You didn't try development-sources, according to your original post. You only
mention 2.6.5 gentoo-sources. What about a vanilla 2.6.5 (2.6.6 already)?
> > What if you
> > take the kernel from SUSE,
>
> I haven't tried installing Gentoo with my SuSE kernel running. Huh...
> what a concept. With all the modularity of those default distro
> kernels, would that even work? Maybe I'd need the kernel, the
> System.map, and the /lib/modules/`uname -r` directory?
You don't even need the System.map.
--
*- Georgi Georgiev *- By golly, I'm beginning to think Linux *-
-* chutz@gg3.net -* really *is* the best thing since sliced -*
*- +81(90)6266-1163 *- bread. *-
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin
2004-05-11 18:46 ` Greg KH
@ 2004-05-12 2:42 ` Josh Glover
2004-05-12 9:31 ` Dan Podeanu
2004-05-12 11:24 ` Kevin
1 sibling, 2 replies; 41+ messages in thread
From: Josh Glover @ 2004-05-12 2:42 UTC (permalink / raw
To: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 6654 bytes --]
Quoth Kevin (Tue 2004-05-11 02:07:58PM -0400):
> In summary, my problem is this: of those that I've tried, I can't get any
> Gentoo kernel to handle SMP operation during major CPU activity (like
> emerging packages) for more than about 5 or 10 minutes. Invariably,
> during such activity, I get a kernel panic---most often with words on the
> console about Machine Check Exception 000000...004 (this number from
> memory so it may be off).
[...]
> The only way that I can get reliable, stable operation with a Gentoo
> kernel and distribution is if I build a kernel without support for SMP.
[...]
> The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID
> controller (using AMI megaraid2 driver) and a LSI Logic Corp controller
> (using Fusion MPT base driver) for the SCSI DAT and with dual 2.4GHz Xeon
> processors, each having a 512KB L2 Cache.
Running Gentoo with a 2.6.5 SMP kernel on a Dell PowerEdge 400SC:
: jmglov@jglover; uname -a
Linux jglover 2.6.5-gentoo-r1 #1 SMP Fri Apr 30 17:37:18 EDT 2004 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux
> Output from /proc/cpuinfo is:
<snip>
: jmglov@jglover; cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 9
cpu MHz : 2395.027
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 4718.59
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 9
cpu MHz : 2395.027
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 4767.74
> I'm a recent Gentoo convert. I think it's an excellent improvement on the
> traditional Linux distros, and I'd really like to use it on my server,
> but as long as this problem with SMP is present, I just can't.
I really do not think it is a Gentoo issue. I have run Gentoo on quite a
few SMP boxen over the past several years, and never had problems like
you describe. Sounds like a hardware issue to me, unless you are using
(or were using) some really bogus CFLAGS.
> PS. FWIW, I'll add that I have a very vague memory while watching text fly
> up the screen during bootstrap.sh or emerge system (this was a stage 1
> install before I installed SuSE over it) of seeing some warning about
> something being unsafe with SMP. Do I need to have some setting or other
> turned off for some parts of a stage 1 install with a dual CPU system?
Nope.
Quoth Kevin (Tue, 11 May 2004 15:38:35 -0400):
> Ok. Thanks for the suggestion. But what about this: Dell has a utility
> partition and some programs for doing exhaustive testing of all the
> hardware in the server. If I run the most thorough set of tests
> available in this utility partition and I get a clean bill of health,
> is that a reliable indication that there are no hardware problems?
Nope. Tragically, it usually works the other way around: hardware test
suites are unlikely to give you a false positive, but if your hardware
passes, that does not mean you are safe. Your issue might be heat-
related, and your CPUs have to heat up for quite some time before they
choke. Combine this with some other issue (maybe you optimised a bit
aggressively when building your kernel?), and you have a tricky issue
for a hardware tester to catch.
Quoth Kevin (Tue, 11 May 2004 17:31:32 -0400):
> Honestly, I'm thinking that I may have somehow built some software
> (during the stage 1 installation process) that is causing these
> problems, but I followed the Gentoo Handbook for doing a stage 1
> installation pretty rigidly, so I'm not sure what I might have done to
> cause that.
Why did you do a Stage 1, just out of curiousity. I recommend doing at
least one Stage 1 install for newcomers to Gentoo, just for educational
purposes, but after that, go Stage 3 and use as many binary packages as
you can! There exists a Stage 3 tarball for your architecture--the
Pentium4 one, so why not use that, just to make sure your base system
is solid?
> When I did the bootstrap.sh and emerge system, I was
> running the kernel that I booted from the boot CD (2004.0 I think, and
> probably even the smp kernel that was on that CD---IIRC, the 2004.1
> boot CD has some problems that prevent the use of the smp kernel on
> that CD).
I don't remember that, but I cannot say for certain that I have tried
the 2004.1 universal x86 CD with the SMP kernel.
> Are there some compiler flags or other configurable settings that, if
> set to certain values during the bootstrap.sh or emerge system steps,
> could end up generating software (perhaps when I built my own gcc?)
> that would cause these MCEs to be thrown?
I dunno, why don't you post your CFLAGS and MAKEOPTS from your make.conf
here?
Quoth Kevin (Tue, 11 May 2004 17:37:39 -0400):
> On Tuesday 11 May 2004 15:38, Paul de Vrieze wrote:
>
>> What if you take the kernel from SUSE,
>
> I haven't tried installing Gentoo with my SuSE kernel running. Huh...
> what a concept. With all the modularity of those default distro
> kernels, would that even work? Maybe I'd need the kernel, the
> System.map, and the /lib/modules/`uname -r` directory?
Yes, you can install Gentoo while running *any* kernel. As long as you
can chroot, you can install Gentoo. See my Faketoo for an example:
http://forums.gentoo.org/viewtopic.php?p=1082580
Note that I do not actually build a kernel and setup the bootloader and
so forth, since I do not need to boot my jailed Gentoo installation--it
is just for ebuild development. However, nothing is stopping *you* from
doing it. :)
--
Josh Glover
GPG keyID 0xDE8A3103 (C3E4 FA9E 1E07 BBDB 6D8B 07AB 2BF1 67A1 DE8A 3103)
gpg --keyserver pgp.mit.edu --recv-keys DE8A3103
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
@ 2004-05-12 9:31 ` Dan Podeanu
2004-05-12 11:26 ` Kevin
2004-05-12 11:24 ` Kevin
1 sibling, 1 reply; 41+ messages in thread
From: Dan Podeanu @ 2004-05-12 9:31 UTC (permalink / raw
To: Gentoo Dev
Some (Greg KH and others) will say I'm crazy, but.. have you tried to
compile the kernel with 2.95.3 instead of the latest gentoo stable, 3.3.2 ?
In my case it -has- helped on more than a couple of occasions.
Cheers,
Dan.
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels)
2004-05-12 1:02 ` Georgi Georgiev
@ 2004-05-12 10:23 ` sf
0 siblings, 0 replies; 41+ messages in thread
From: sf @ 2004-05-12 10:23 UTC (permalink / raw
To: gentoo-dev
Georgi Georgiev wrote:
> maillog: 11/05/2004-17:37:39(-0400): Kevin types
...
>>I haven't tried installing Gentoo with my SuSE kernel running. Huh...
>>what a concept. With all the modularity of those default distro
>>kernels, would that even work? Maybe I'd need the kernel, the
>>System.map, and the /lib/modules/`uname -r` directory?
>
>
> You don't even need the System.map.
>
But you most likely will need an initrd.
I am using the custom 2.6.4 kernel from SuSE 9.1 (it has working isdn
support for avm fritzcards with avm's binary modules) on one of my
gentoo systems. This kernel does not have reiserfs compiled in. If you
want to use SuSE's initrd you have to setup udev as well.
Until now everything works perfectly.
Regards
Stephan
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
2004-05-12 9:31 ` Dan Podeanu
@ 2004-05-12 11:24 ` Kevin
2004-05-12 11:48 ` Josh Glover
1 sibling, 1 reply; 41+ messages in thread
From: Kevin @ 2004-05-12 11:24 UTC (permalink / raw
To: Gentoo Dev
Thanks for your reply, Josh.
On Tuesday 11 May 2004 22:42, Josh Glover wrote:
> Quoth Kevin (Tue 2004-05-11 02:07:58PM -0400):
> > The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID
> > controller (using AMI megaraid2 driver) and a LSI Logic Corp
> > controller (using Fusion MPT base driver) for the SCSI DAT and with
> > dual 2.4GHz Xeon processors, each having a 512KB L2 Cache.
>
> Running Gentoo with a 2.6.5 SMP kernel on a Dell PowerEdge 400SC:
> : jmglov@jglover; uname -a
>
> Linux jglover 2.6.5-gentoo-r1 #1 SMP Fri Apr 30 17:37:18 EDT 2004 i686
> Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux
>
> : jmglov@jglover; cat /proc/cpuinfo
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 15
> model : 2
> model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
> stepping : 9
> cpu MHz : 2395.027
> cache size : 512 KB
> physical id : 0
> siblings : 2
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
> bogomips : 4718.59
>
> processor : 1
> vendor_id : GenuineIntel
> cpu family : 15
> model : 2
> model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
> stepping : 9
> cpu MHz : 2395.027
> cache size : 512 KB
> physical id : 0
> siblings : 2
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
> bogomips : 4767.74
Have you turned off hyperthreading? Why is it that only two CPUs show up?
It looks like (flags include ht) the CPUs support hyperthreading... or am
I way off base in drawing that conclusion here?
>
> > I'm a recent Gentoo convert. I think it's an excellent improvement
> > on the traditional Linux distros, and I'd really like to use it on my
> > server, but as long as this problem with SMP is present, I just
> > can't.
>
> I really do not think it is a Gentoo issue. I have run Gentoo on quite
> a few SMP boxen over the past several years, and never had problems
> like you describe. Sounds like a hardware issue to me, unless you are
> using (or were using) some really bogus CFLAGS.
Well, last night I did the most exhaustive set of tests available on the
Dell Utility Partition and found zero errors. I did 16 loops of the MATS
test and 15 of TestY, TestA, and ECC coupling. The memory tests ran for
about 8 hours. Again, zero errors.
> Quoth Kevin (Tue, 11 May 2004 17:31:32 -0400):
> > Honestly, I'm thinking that I may have somehow built some software
> > (during the stage 1 installation process) that is causing these
> > problems, but I followed the Gentoo Handbook for doing a stage 1
> > installation pretty rigidly, so I'm not sure what I might have done
> > to cause that.
>
> Why did you do a Stage 1, just out of curiousity.
Some silly sense of pride or accomplishment that every bit of code running
on the box was actually compiled on it. I really like that Gentoo makes
that option a practical one. Silly, I know...
> I recommend doing at
> least one Stage 1 install for newcomers to Gentoo, just for educational
> purposes, but after that, go Stage 3 and use as many binary packages as
> you can! There exists a Stage 3 tarball for your architecture--the
> Pentium4 one, so why not use that, just to make sure your base system
> is solid?
That probably is a good thing for me to try next, and I really would like
to be able to use all the advantages of Gentoo on this server. I'm fed
up with rpm hell and the traditional distros drawbacks.
>
> > When I did the bootstrap.sh and emerge system, I was
> > running the kernel that I booted from the boot CD (2004.0 I think,
> > and probably even the smp kernel that was on that CD---IIRC, the
> > 2004.1 boot CD has some problems that prevent the use of the smp
> > kernel on that CD).
>
> I don't remember that, but I cannot say for certain that I have tried
> the 2004.1 universal x86 CD with the SMP kernel.
It may have changed (I did this just a few days after the 2004.1 CD came
out, and I think it was the minimal one).
>
> > Are there some compiler flags or other configurable settings that, if
> > set to certain values during the bootstrap.sh or emerge system steps,
> > could end up generating software (perhaps when I built my own gcc?)
> > that would cause these MCEs to be thrown?
>
> I dunno, why don't you post your CFLAGS and MAKEOPTS from your
> make.conf here?
Did already :)
Here they are again:
On Tuesday 11 May 2004 14:07, Kevin wrote:
> Gentoo emerge info output:
> =======
> System uname: 2.4.25-gentoo-r2 i686 Intel(R) Xeon(TM) CPU 2.40GHz
> Gentoo Base System version 1.4.9
> distcc 2.13 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632)
> [enabled]
> ccache version 2.3 [enabled]
> Autoconf: sys-devel/autoconf-2.58-r1
> Automake: sys-devel/automake-1.8.3
> ACCEPT_KEYWORDS="x86"
> AUTOCLEAN="yes"
> CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
> CHOST="i686-pc-linux-gnu"
> COMPILER="gcc3"
> CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config
> /usr/kde/3.2/share/config /usr/kde/3/share/config
> /usr/lib/mozilla/defaults/pref /usr/share/config
> /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/
> /usr/share/texmf/tex/generic/config/
> /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/
> /var/qmail/control" CONFIG_PROTECT_MASK="/etc/afs/C /etc/afs/afsws
> /etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O3 -march=pentium4
> -pipe -fomit-frame-pointer"
> DISTDIR="/usr/portage/distfiles"
> FEATURES="autoaddcvs ccache distcc sandbox"
> GENTOO_MIRRORS="http://128.213.5.34/gentoo/
> http://mirror.datapipe.net/gentoo
> ftp://mirrors.sec.informatik.tu-darmstadt.de/gentoo/
> http://gentoo.eliteitminds.com"
> MAKEOPTS="-j3"
> PKGDIR="/usr/portage/packages"
> PORTAGE_TMPDIR="/var/tmp"
> PORTDIR="/usr/portage"
> PORTDIR_OVERLAY=""
> SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage"
> USE="X Xaw3d acl acpi afs alsa apache2 apm arts avi berkdb bonobo caps
> crypt
> cups doc emacs emacs-w3 encode esd ethereal evo firebird flac
> foomaticdb gdbm
> gif gnome gpm gstreamer gtk gtk2 gtkhtml guile hardened icq imagemagick
> imap
> imlib innodb ipv6 jabber jack java jikes jpeg kde kerberos krb4 ldap
> libg++
> libwww mad mcal mikmod motif mozilla mpeg mysql ncurses nls odbc
> oggvorbis opengl oss pam pda pdflib perl plotutils png ppds prelude
> python qt quicktime
> readline ruby samba sasl sdl slang slp spell sse ssl svga tcltk tcpd
> tetex tiff
> truetype unicode usb vhosts x86 xinerama xml2 xmms xv zeo zlib"
> =======
>
Thanks again for the thoughtful reply, Josh.
--
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 9:31 ` Dan Podeanu
@ 2004-05-12 11:26 ` Kevin
0 siblings, 0 replies; 41+ messages in thread
From: Kevin @ 2004-05-12 11:26 UTC (permalink / raw
To: Gentoo Dev
On Wednesday 12 May 2004 05:31, Dan Podeanu wrote:
> Some (Greg KH and others) will say I'm crazy, but.. have you tried to
> compile the kernel with 2.95.3 instead of the latest gentoo stable,
> 3.3.2 ? In my case it -has- helped on more than a couple of occasions.
Haven't tried that, but thanks for the suggestion and the hints of your
experience, Dan.
--
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 11:24 ` Kevin
@ 2004-05-12 11:48 ` Josh Glover
2004-05-12 12:14 ` Ciaran McCreesh
2004-05-12 13:58 ` Kevin
0 siblings, 2 replies; 41+ messages in thread
From: Josh Glover @ 2004-05-12 11:48 UTC (permalink / raw
To: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 3654 bytes --]
Quoth Kevin (Wed 2004-05-12 07:24:07AM -0400):
> On Tuesday 11 May 2004 22:42, Josh Glover wrote:
>
> > Quoth Kevin (Tue 2004-05-11 02:07:58PM -0400):
> >
> > > The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID
> > > controller (using AMI megaraid2 driver) and a LSI Logic Corp
> > > controller (using Fusion MPT base driver) for the SCSI DAT and with
> > > dual 2.4GHz Xeon processors, each having a 512KB L2 Cache.
> >
> > Running Gentoo with a 2.6.5 SMP kernel on a Dell PowerEdge 400SC:
> > : jmglov@jglover; uname -a
> >
> > Linux jglover 2.6.5-gentoo-r1 #1 SMP Fri Apr 30 17:37:18 EDT 2004 i686
> > Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux
> >
> > : jmglov@jglover; cat /proc/cpuinfo
> >
> > processor : 0
[...]
> > processor : 1
>
> Have you turned off hyperthreading? Why is it that only two CPUs show up?
> It looks like (flags include ht) the CPUs support hyperthreading... or am
> I way off base in drawing that conclusion here?
No, the only part where you went off base was in assuming that I have more
than one *physical* CPU in the box. I have one, and with hyperthreading
turned on, it looks like two to the kernel.
> Well, last night I did the most exhaustive set of tests available on the
> Dell Utility Partition and found zero errors. I did 16 loops of the MATS
> test and 15 of TestY, TestA, and ECC coupling. The memory tests ran for
> about 8 hours. Again, zero errors.
[...]
> > I recommend doing at
> > least one Stage 1 install for newcomers to Gentoo, just for educational
> > purposes, but after that, go Stage 3 and use as many binary packages as
> > you can! There exists a Stage 3 tarball for your architecture--the
> > Pentium4 one, so why not use that, just to make sure your base system
> > is solid?
>
> That probably is a good thing for me to try next, and I really would like
> to be able to use all the advantages of Gentoo on this server. I'm fed
> up with rpm hell and the traditional distros drawbacks.
Yes, I really think that if you use a Stage 3, you can feel pretty
confident about your C compiler and libraries being stable. Just be
careful when you compile your kernel!
> > > Are there some compiler flags or other configurable settings that, if
> > > set to certain values during the bootstrap.sh or emerge system steps,
> > > could end up generating software (perhaps when I built my own gcc?)
> > > that would cause these MCEs to be thrown?
> >
> > I dunno, why don't you post your CFLAGS and MAKEOPTS from your
> > make.conf here?
>
> Did already :)
Lost them in the spew, sorry.
> Here they are again:
>
> ACCEPT_KEYWORDS="x86"
This is unnecessary. You only need to use ACCEPT_KEYWORDS with the
unstable keywords, ~x86 in your case.
> CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
Might want to back off to -O2, at least when you compile the kernel if
nothing else. I believe the handbook recommends -O2, and optimising too
highly can lead to some pretty bizarre problems.
> CXXFLAGS="-O3 -march=pentium4
I am not sure if you are setting this, or if this is just emerge, but
you usually want to set CXXFLAGS="${CFLAGS}".
> FEATURES="autoaddcvs ccache distcc sandbox"
autoaddcvs is a feature only for developers, you should not turn it on.
> MAKEOPTS="-j3"
You have four CPUs, so set this to -j5.
> Thanks again for the thoughtful reply, Josh.
Hey, I live to serve. :)
--
Josh Glover
GPG keyID 0xDE8A3103 (C3E4 FA9E 1E07 BBDB 6D8B 07AB 2BF1 67A1 DE8A 3103)
gpg --keyserver pgp.mit.edu --recv-keys DE8A3103
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 11:48 ` Josh Glover
@ 2004-05-12 12:14 ` Ciaran McCreesh
2004-05-12 13:58 ` Kevin
1 sibling, 0 replies; 41+ messages in thread
From: Ciaran McCreesh @ 2004-05-12 12:14 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 737 bytes --]
On Wed, 12 May 2004 07:48:22 -0400 Josh Glover <jmglov@gentoo.org>
wrote:
| > CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
|
| Might want to back off to -O2, at least when you compile the kernel if
| nothing else. I believe the handbook recommends -O2, and optimising
| too highly can lead to some pretty bizarre problems.
Kernel compiles don't use CFLAGS from make.conf. Unless you edit the
kernel makefiles yourself, a sane set will be used. If you *do* edit the
makefiles, don't expect anyone to help when things go horribly wrong.
--
Ciaran McCreesh, Gentoo XMLcracy Member G03X276
(Sparc, MIPS, Vim, si hoc legere scis nimium eruditionis habes)
Mail: ciaranm at gentoo.org
Web: http://dev.gentoo.org/~ciaranm
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 11:48 ` Josh Glover
2004-05-12 12:14 ` Ciaran McCreesh
@ 2004-05-12 13:58 ` Kevin
2004-05-12 14:44 ` Chris Gianelloni
` (2 more replies)
1 sibling, 3 replies; 41+ messages in thread
From: Kevin @ 2004-05-12 13:58 UTC (permalink / raw
To: Gentoo Dev
On Wednesday 12 May 2004 07:48, Josh Glover wrote:
> Quoth Kevin (Wed 2004-05-12 07:24:07AM -0400):
> >
> > Have you turned off hyperthreading? Why is it that only two CPUs
> > show up? It looks like (flags include ht) the CPUs support
> > hyperthreading... or am I way off base in drawing that conclusion
> > here?
>
> No, the only part where you went off base was in assuming that I have
> more than one *physical* CPU in the box. I have one, and with
> hyperthreading turned on, it looks like two to the kernel.
Oh.... Well, that seems like an important difference between yours and my
arrangements. Are you running Gentoo on a box with more than one
physical CPU? I had no problems running Gentoo on this box until after
installing a second CPU. That's when the weirdness started.
Is anyone here running Gentoo on a dual-physical CPU machine? What
compiler flags are you using?
> > > I dunno, why don't you post your CFLAGS and MAKEOPTS from your
> > > make.conf here?
> >
> > Did already :)
>
> Lost them in the spew, sorry.
>
> > Here they are again:
> >
> > ACCEPT_KEYWORDS="x86"
>
> This is unnecessary. You only need to use ACCEPT_KEYWORDS with the
> unstable keywords, ~x86 in your case.
Ok. Thanks.
>
> > CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
>
> Might want to back off to -O2, at least when you compile the kernel if
> nothing else. I believe the handbook recommends -O2, and optimising too
> highly can lead to some pretty bizarre problems.
K. What CFLAGS are you using? (on both your single physical CPU box you
described above and with any multiple physical CPU boxes)
>
> > CXXFLAGS="-O3 -march=pentium4
>
> I am not sure if you are setting this, or if this is just emerge, but
> you usually want to set CXXFLAGS="${CFLAGS}".
I'll hafta look at that (well, I can't anymore). I thought I did have
CXXFLAGS=${CFLAGS}".
>
> > FEATURES="autoaddcvs ccache distcc sandbox"
Huh. That's odd. When I installed distcc, I changed the FEATURES as
indicated in the docs to list distcc, but I didn't add any of those
others. Wonder where they came from...
>
> autoaddcvs is a feature only for developers, you should not turn it on.
K.
>
> > MAKEOPTS="-j3"
>
> You have four CPUs, so set this to -j5.
I was thinking "number of physical CPUs + 1" here, but ok.
>
> > Thanks again for the thoughtful reply, Josh.
>
> Hey, I live to serve. :)
:) Thanks.
--
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 13:58 ` Kevin
@ 2004-05-12 14:44 ` Chris Gianelloni
2004-05-12 15:17 ` tom_gall
[not found] ` <40A23987.9080104@gentoo.org>
2 siblings, 0 replies; 41+ messages in thread
From: Chris Gianelloni @ 2004-05-12 14:44 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]
On Wed, 2004-05-12 at 09:58, Kevin wrote:
> Oh.... Well, that seems like an important difference between yours and my
> arrangements. Are you running Gentoo on a box with more than one
> physical CPU? I had no problems running Gentoo on this box until after
> installing a second CPU. That's when the weirdness started.
>
> Is anyone here running Gentoo on a dual-physical CPU machine? What
> compiler flags are you using?
I am running Gentoo on several dual-CPU machines, and even one quad-CPU
machine with no troubles at all.
CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer"
> > > FEATURES="autoaddcvs ccache distcc sandbox"
>
> Huh. That's odd. When I installed distcc, I changed the FEATURES as
> indicated in the docs to list distcc, but I didn't add any of those
> others. Wonder where they came from...
They came from the defaults. Unless you set -autoaddcvs, etc in your
FEATURES. There's no need to try to remove autoaddcvs, since it just
won't work.
> I was thinking "number of physical CPUs + 1" here, but ok.
You would probably get a speed boost going with -j5 rather than -j3, but
I would bench it myself before trusting anything from an external
source.
--
Chris Gianelloni
Developer, Gentoo Linux
Games Team
Is your power animal a penguin?
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 13:58 ` Kevin
2004-05-12 14:44 ` Chris Gianelloni
@ 2004-05-12 15:17 ` tom_gall
2004-05-13 11:06 ` Kevin
[not found] ` <40A23987.9080104@gentoo.org>
2 siblings, 1 reply; 41+ messages in thread
From: tom_gall @ 2004-05-12 15:17 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Greetings,
Just to give this another perspective.
On Wednesday, May 12, 2004, at 08:58 AM, Kevin wrote:
> On Wednesday 12 May 2004 07:48, Josh Glover wrote:
>> Quoth Kevin (Wed 2004-05-12 07:24:07AM -0400):
>>>
>>> Have you turned off hyperthreading? Why is it that only two CPUs
>>> show up? It looks like (flags include ht) the CPUs support
>>> hyperthreading... or am I way off base in drawing that conclusion
>>> here?
>>
>> No, the only part where you went off base was in assuming that I have
>> more than one *physical* CPU in the box. I have one, and with
>> hyperthreading turned on, it looks like two to the kernel.
>
> Oh.... Well, that seems like an important difference between yours
> and my
> arrangements. Are you running Gentoo on a box with more than one
> physical CPU? I had no problems running Gentoo on this box until after
> installing a second CPU. That's when the weirdness started.
>
> Is anyone here running Gentoo on a dual-physical CPU machine? What
> compiler flags are you using?
All my ppc64 hardware is SMP, and runs gentoo just fine. I suspect you
have some
specific intel-ish problem. Could be BIOS or any other variety of
problems.
>>> CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer"
>>
>> Might want to back off to -O2, at least when you compile the kernel if
>> nothing else. I believe the handbook recommends -O2, and optimising
>> too
>> highly can lead to some pretty bizarre problems.
>
> K. What CFLAGS are you using? (on both your single physical CPU box
> you
> described above and with any multiple physical CPU boxes)
CFLAGS="-O3 -mtune=power3 or -mtune=g5 " or just -mcpu=powerpc64
>>> MAKEOPTS="-j3"
>>
>> You have four CPUs, so set this to -j5.
>
> I was thinking "number of physical CPUs + 1" here, but ok.
I generally set to number of physical CPUs x 2. For HMT systems (which
include certain ppc64 boxes as well) number of physical CPUs x 4 is
fine.
Regards,
Tom
Tom Gall
gentoo-ppc64 lead -- God started with stage 1, shouldn't you?
tgall aatt gentoo.org
tgall aatt uberh4x0r.org
tom_gall aatt mac.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)
iD4DBQFAokAGNM6ZoaBWhQkRAhkFAJdpTF240+JymZBXCkgBuXmvQntBAJ9ovulG
LLYj/Xm4J2rSNQqQ/h1VdQ==
=JRtW
-----END PGP SIGNATURE-----
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels)
[not found] ` <40A23987.9080104@gentoo.org>
@ 2004-05-12 16:22 ` Kevin
2004-05-12 16:59 ` Greg KH
2004-05-12 17:15 ` Sven Vermeulen
0 siblings, 2 replies; 41+ messages in thread
From: Kevin @ 2004-05-12 16:22 UTC (permalink / raw
To: Gentoo Dev
Thank you Chris, Bret, and Heiko for your replies, both on- and off-list.
Your replies all look like good suggestions and I'm going to try them, but
first, I have to ask one other thing.
As I said earlier in this thread, I found zero errors in an 8 hour
exhaustive memory test (using the Dell-provided Utility Partition tests)
running 15-16 loops, 4 different tests, and found zero errors from a
complete hardware test (also from the Utility Partition).
But just on a lark, I decided to try memtest86 3.0 and 3.1a as well, and
both are turning up errors all over the place. I'm skeptical that
memtest86 is giving me accurate information because (a) it's finding so
many, (b) I can't seem to get it to turn on ECC mode (yet the Dell
utilities did test this, and the CMOS reports that it is ECC memory), (c)
it runs for only 10 seconds or so and begins finding errors, and (d) it
locks up after about 2-5 minutes. Last time I ran it, the error count
was up to 109 after 5 minutes and then it locked up.
Any thoughts on this?
Is this bad memory (in spite of the Dell tests all turning up flawless) or
is memtest86 getting something wrong (like amount of installed memory to
test)? Any way to tell for sure? I've looked at the docs for memtest86
and they talk about the possibility of memtest86 incorrectly determining
the amount of memory to test and that seems likely in my case. The first
10 or so errors are all at the same address (0003fffdc80) 1023.8MB, then
there are 4 or so at 64.0MB, then 1 or 2 at 0.6MB, then it locks up.
The Dell utilities report testing only up to 1022MB of memory.
Is this a case of memtest86 getting the installed memory count wrong?
When I look at the CMOS/BIOS settings, the System memory is 1024MB, and
that's what the Dell utilities also report initially, but the tests
themselves are only being run on the first 1022MB, according to the test
reports.
Thanks.
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels)
2004-05-12 16:22 ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin
@ 2004-05-12 16:59 ` Greg KH
2004-05-12 17:18 ` Scott Myron
2004-05-12 17:15 ` Sven Vermeulen
1 sibling, 1 reply; 41+ messages in thread
From: Greg KH @ 2004-05-12 16:59 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
On Wed, May 12, 2004 at 12:22:15PM -0400, Kevin wrote:
>
> Any thoughts on this?
Trust memtest86, it is known to exercise memory quite well, and finds
real errors.
I wouldn't trust the dell "tests" at all, as who knows what they are
really testing...
Sounds like you have hardware problems.
greg k-h
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels)
2004-05-12 16:22 ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin
2004-05-12 16:59 ` Greg KH
@ 2004-05-12 17:15 ` Sven Vermeulen
1 sibling, 0 replies; 41+ messages in thread
From: Sven Vermeulen @ 2004-05-12 17:15 UTC (permalink / raw
To: Gentoo Dev
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1182 bytes --]
On Wed, May 12, 2004 at 12:22:15PM -0400, Kevin wrote:
> As I said earlier in this thread, I found zero errors in an 8 hour
> exhaustive memory test (using the Dell-provided Utility Partition tests)
> running 15-16 loops, 4 different tests, and found zero errors from a
> complete hardware test (also from the Utility Partition).
>
> But just on a lark, I decided to try memtest86 3.0 and 3.1a as well, and
> both are turning up errors all over the place.
We had exactly the same error; PowerEdge 1600 SC. I thought it was because of
Dell's memory chips, but when I sticked others in (dunno what vendor though)
I received the same issues (with memtest86) again.
A different motherbord resolved the issue. I never had the MCE errors
though (but then again, it didn't - and still doesn't - run Gentoo).
Just my 2 ¢,
Sven Vermeulen
--
Bent Hindrup Andersen, Danish MEP, about the Software Patent Directive:
The approach of the Commission and Council in this directive is shocking.
They are making full use of all the possibilities of evading democracy that
the current Community Law provides. <http://lwn.net/Articles/84009/>
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels)
2004-05-12 16:59 ` Greg KH
@ 2004-05-12 17:18 ` Scott Myron
0 siblings, 0 replies; 41+ messages in thread
From: Scott Myron @ 2004-05-12 17:18 UTC (permalink / raw
To: gentoo-dev; +Cc: Kevin
Greg KH wrote:
> On Wed, May 12, 2004 at 12:22:15PM -0400, Kevin wrote:
>
>>Any thoughts on this?
>
You may also want to try removing all but one stick of memory, and rerun
memtest. If that fails, replace that with another stick of memory, and
rerun the test again. This will help you find out which stick of memory
is bad. If you only have one stick of memory, borrow some from a friend,
if possible...
You might also want to try testing your memory in a friend's machine, to
verify the results. It's possible that there is a problem with the
traces on the motherboard from the northbridge to the memory
slots(unlikely, but possible).
Scott
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-12 15:17 ` tom_gall
@ 2004-05-13 11:06 ` Kevin
2004-05-13 11:12 ` Senor Rodgman
` (3 more replies)
0 siblings, 4 replies; 41+ messages in thread
From: Kevin @ 2004-05-13 11:06 UTC (permalink / raw
To: Gentoo Dev
On Wednesday 12 May 2004 11:17, tom_gall@mac.com wrote:
> Greetings,
>
> Just to give this another perspective.
>
[...]
Thanks for your reply, Tom. At least I know it should be doable.
To all who've commented on this thread, thanks again.
I've now tried a stage 3 installation booting the 2.6.1 SMP kernel from a
2004.0 LiveCD (the SMP configs on 2004.1 LiveCDs are all broken---see bug
#49382).
I had no lockup problems while running that kernel, but after rebooting
with my kernel (gentoo-sources, built with Chris's
CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer") I've already
suffered two lockups. I set MAKEOPS="-j1" for safety.
Something very weird here. Next, I'm going to try booting from the cd and
chrooting into my system and then doing more extensive testing of the
kernel on the cd, but I'm really running out of options here. I'll
probably also try building another kernel with CFLAGS="-march=pentium3
-O2 -pipe". Any other suggestions?
Does anyone think that my two CPUs having different stepping levels could
have anything to do with this problem? One is level 7 and the other 9.
Greg KH thinks it's bad memory, but I'm skeptical of that because the main
address that fails (some 30 times in a row) is at 1023.8MB and the Dell
Utilities only test up to 1022MB, and because I haven't seen the problem
with the liveCD kernel.
--
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-13 11:06 ` Kevin
@ 2004-05-13 11:12 ` Senor Rodgman
2004-05-13 13:04 ` Chris Gianelloni
` (2 subsequent siblings)
3 siblings, 0 replies; 41+ messages in thread
From: Senor Rodgman @ 2004-05-13 11:12 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
On Thu, 13 May 2004, Kevin wrote:
> I had no lockup problems while running that kernel, but after rebooting
> with my kernel (gentoo-sources, built with Chris's
> CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer") I've already
> suffered two lockups. I set MAKEOPS="-j1" for safety.
> Greg KH thinks it's bad memory, but I'm skeptical of that because the main
> address that fails (some 30 times in a row) is at 1023.8MB and the Dell
> Utilities only test up to 1022MB, and because I haven't seen the problem
> with the liveCD kernel.
I had some similar problems recently (on dual athlon), where it was
running oldish kernels (2.5.4 I think) OK, but 2.6.5 & later wouldn't
boot. Memtest reported bad memory; booting the new kernels with a suitable
mem= parameter confirmed this (they then booted fine, and continue to be
fine with replacement memory). So I recommend checking the memory.
dave
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-13 11:06 ` Kevin
2004-05-13 11:12 ` Senor Rodgman
@ 2004-05-13 13:04 ` Chris Gianelloni
2004-05-13 15:04 ` Daniel Drake
2004-05-13 15:54 ` Greg KH
3 siblings, 0 replies; 41+ messages in thread
From: Chris Gianelloni @ 2004-05-13 13:04 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 2336 bytes --]
On Thu, 2004-05-13 at 07:06, Kevin wrote:
> I've now tried a stage 3 installation booting the 2.6.1 SMP kernel from a
> 2004.0 LiveCD (the SMP configs on 2004.1 LiveCDs are all broken---see bug
> #49382).
I know this isn't exactly what you're looking for, but I have a CD
(actually, a GameCD beta) available at
http://dev.gentoo.org/~wolf31o2/x86-ut2004demo-20040420.iso that you
could grab. It has only one kernel, and it is SMP. It has booted and
worked successfully on every machine I have tried it on, and even has
X+fluxbox on it.
> I had no lockup problems while running that kernel, but after rebooting
> with my kernel (gentoo-sources, built with Chris's
> CFLAGS="-march=pentium4 -O2 -pipe -fomit-frame-pointer") I've already
> suffered two lockups. I set MAKEOPS="-j1" for safety.
Copy the kernel from my CD and the /lib/modules/2.6.5-gentoo-r1 and see
if that kernel works fine on your machine in your build environment.
> Something very weird here. Next, I'm going to try booting from the cd and
> chrooting into my system and then doing more extensive testing of the
> kernel on the cd, but I'm really running out of options here. I'll
> probably also try building another kernel with CFLAGS="-march=pentium3
> -O2 -pipe". Any other suggestions?
Try my CD... it works in SMP. That will help test some of the problems,
especially since the kernel you are "testing" with is not SMP, so you're
not really testing anything.
> Does anyone think that my two CPUs having different stepping levels could
> have anything to do with this problem? One is level 7 and the other 9.
It is possible that is causing the problem. You never really know. I
*doubt* it should be a problem, unless one CPU is running out of spec.
> Greg KH thinks it's bad memory, but I'm skeptical of that because the main
> address that fails (some 30 times in a row) is at 1023.8MB and the Dell
> Utilities only test up to 1022MB, and because I haven't seen the problem
> with the liveCD kernel.
It still could be bad memory. I think I would trust memtest86 before
the Dell utilities. You could also try finding another bootable system
checker. I'm sure there are plenty available.
--
Chris Gianelloni
Developer, Gentoo Linux
Games Team
Is your power animal a penguin?
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-13 11:06 ` Kevin
2004-05-13 11:12 ` Senor Rodgman
2004-05-13 13:04 ` Chris Gianelloni
@ 2004-05-13 15:04 ` Daniel Drake
2004-05-13 15:54 ` Greg KH
3 siblings, 0 replies; 41+ messages in thread
From: Daniel Drake @ 2004-05-13 15:04 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
Hi Kevin,
Kevin wrote:
> Greg KH thinks it's bad memory, but I'm skeptical of that because the main
> address that fails (some 30 times in a row) is at 1023.8MB and the Dell
> Utilities only test up to 1022MB, and because I haven't seen the problem
> with the liveCD kernel.
Although I've very rarely dealt with SMP systems, I've seen many unstable
systems being diagnosed by various memory testing utilites as OK. As soon as
you run memtest, errors come up, and replacing the faulty memory amazingly
brings system stability again.
If you RAM is always producing errors in the same place (and only in 1 place)
then you might want to google for BadMem/BadRAM. These are two flavours of
kernel patches which allow you to ask the kernel to ignore specific blocks of
memory. You can even get memtest-x86 to output the exact parameters you need
based on memory faults it finds. This should allow you to ignore the faulty
part of the memory and continue on with the remaining ~1020mb or so.
Daniel
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-13 11:06 ` Kevin
` (2 preceding siblings ...)
2004-05-13 15:04 ` Daniel Drake
@ 2004-05-13 15:54 ` Greg KH
2004-05-18 8:29 ` Kevin
3 siblings, 1 reply; 41+ messages in thread
From: Greg KH @ 2004-05-13 15:54 UTC (permalink / raw
To: Kevin; +Cc: Gentoo Dev
On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote:
>
> Greg KH thinks it's bad memory,
It's not only me, it's memtest86 saying it :)
> but I'm skeptical of that because the main address that fails (some 30
> times in a row) is at 1023.8MB and the Dell Utilities only test up to
> 1022MB, and because I haven't seen the problem with the liveCD kernel.
Maybe that's the fault of the Dell utilities. Seriously, I trust
memtest86 over any other vendor specific test. If you don't want to
believe it, that's fine, but I would really consider fixing that issue
before trying to point the finger at the kernel or the Gentoo install.
thanks,
greg k-h
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-13 15:54 ` Greg KH
@ 2004-05-18 8:29 ` Kevin
2004-05-18 10:59 ` Alexander Futasz
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Kevin @ 2004-05-18 8:29 UTC (permalink / raw
To: gentoo-dev
Again, thanks to all who have commented on this thread. I've now done
some more testing and have some other interesting (though also confusing)
results to report.
On Thursday 13 May 2004 11:54, Greg KH wrote:
> On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote:
> > Greg KH thinks it's bad memory,
>
> It's not only me, it's memtest86 saying it :)
True. Although it is locking up after only 1-2 minutes of operation.
What conclusion should I draw from that?
>
> > but I'm skeptical of that because the main address that fails (some
> > 30 times in a row) is at 1023.8MB and the Dell Utilities only test up
> > to 1022MB, and because I haven't seen the problem with the liveCD
> > kernel.
>
> Maybe that's the fault of the Dell utilities. Seriously, I trust
> memtest86 over any other vendor specific test. If you don't want to
> believe it, that's fine, but I would really consider fixing that issue
> before trying to point the finger at the kernel or the Gentoo install.
You're right, Greg. I finally took your advice and did some serious
testing with the DIMM sticks. This box has 4 slots, DIMMA-DIMMD, and
here's what I've done:
1) swapped one 512MB stick for the other in DIMMA/DIMMB (reversed their
positions)
2) removed one 512MB stick from DIMMB (configs require filling from DIMMA
up)
3) removed the other 512MB stick (so that now I've tried each stick in
DIMMA all by itself, and no sticks in any of the other slots)
4) completely replaced each 512MB stick with new ones from Dell and did
all of 1-3 above with the new sticks.
In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all behave
very similarly. That is, they show 1023.8MB (or 511.8MB if only one
stick installed) as repeatedly failing (some 30 or 40 times), then they
do either (a) show 304.5MB failing and three more failed tests of
1023.8MB (or 511.8MB) and then the program locks up; or (b) show three
failed tests at 64.0MB, then three more at 1023.8MB (or 511.8MB), then
one more failed test at 64.0MB, then one more at 0.6MB, then one more at
1023.8MB (or 511.8MB), and then the program locks up.
Since I had the extra sticks, I also tried testing with all 4 slots filled
and got very similar results to those described above, except the
repeatedly failing address was 2047.8MB (in all cases, 512MB, 1024MB, and
2048MB, the repeately failing address is 0.2MB below the max).
There are no intermittent failing addresses---there are two very specific
patterns to the failures, and the program always locks up after following
one pattern or the other.
In all of the memory configurations I tried, the Dell utilities reported
no memory errors (or any other hardware errors).
Although I'm sure there are others here with more experience
troubleshooting such problems, I'm thinking that the above is enough to
base a pretty sound conclusion upon, and the conclusion I would draw is
that hardware and memory are not the cause of these MCE problems. I
welcome anyone contradicting that conclusion because I've never seen
anything like this before and I'm at a loss on how to resolve it. I'm
tempted to try replacing one of the CPUs to see if identical stepping
levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are
otherwise identical) will resolve the problem.
I also tried getting memtest86 (and variants) to let me turn on the ECC
portion of the tests to no avail, and when I tried sizing the memory, the
probe returned 1024MB, the use bios std setting returned 1024MB, and the
use bios all setting locked the program up.
I also tried something else that had an enormous positive effect on the
situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS and
built another kernel with identical .config settings. With that kernel
running, I did some 2-4 hours of solid compiling work, emerging and
re-emerging packages like mysql, cyrus-sasl, cyrus-imapd, mit-krb5,
openafs, etc. But unfortunately, this kernel also ended up freezing
after doing more of the same, and it did so with the same error message
MCE 0000000000000004.
I tried using parsemce.c from
http://www.codemonkey.org.uk/cruft/parsemce.c/. I built it and ran it,
but it wasn't very helpful and I'm not quite sure what I'm supposed to do
with it.
Chris, I'm going to try your kernel. Thanks for offering that. I'll
relate whatever I learn from that test.
Again, I really appreciate all the thoughtful replies on what to try next
to resolve this problem. If there are any others, or if anyone has
suggestions on what to try next, I'd love to hear them. Perhaps I could
send my .config file to someone and they could try cross-compiling a
kernel for me to try running?
Thanks again.
--
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-18 8:29 ` Kevin
@ 2004-05-18 10:59 ` Alexander Futasz
2004-05-18 12:02 ` Josh Glover
2004-05-18 12:46 ` [gentoo-dev] " Daniel Drake
2 siblings, 0 replies; 41+ messages in thread
From: Alexander Futasz @ 2004-05-18 10:59 UTC (permalink / raw
To: gentoo-dev
On Tue, 18 May 2004 04:29:58 -0400, Kevin wrote:
> Again, thanks to all who have commented on this thread. I've now done
> some more testing and have some other interesting (though also
> confusing) results to report.
>
> On Thursday 13 May 2004 11:54, Greg KH wrote:
> > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote:
> > > Greg KH thinks it's bad memory,
> >
> > It's not only me, it's memtest86 saying it :)
[...]
> You're right, Greg. I finally took your advice and did some serious
> testing with the DIMM sticks. [...]
> In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all
> behave very similarly. That is [...] the program locks up.
>
> In all of the memory configurations I tried, the Dell utilities
> reported no memory errors (or any other hardware errors).
I think you missed this one reply to your posts:
http://article.gmane.org/gmane.linux.gentoo.devel/17942
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-18 8:29 ` Kevin
2004-05-18 10:59 ` Alexander Futasz
@ 2004-05-18 12:02 ` Josh Glover
2004-05-19 17:48 ` Kevin
2004-05-18 12:46 ` [gentoo-dev] " Daniel Drake
2 siblings, 1 reply; 41+ messages in thread
From: Josh Glover @ 2004-05-18 12:02 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1683 bytes --]
Quoth Kevin (Tue 2004-05-18 04:29:58AM -0400):
> On Thursday 13 May 2004 11:54, Greg KH wrote:
>
> > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote:
> >
> > > Greg KH thinks it's bad memory,
> >
> > It's not only me, it's memtest86 saying it :)
>
> True. Although it is locking up after only 1-2 minutes of operation.
> What conclusion should I draw from that?
Bad system board. :(
> Although I'm sure there are others here with more experience
> troubleshooting such problems, I'm thinking that the above is enough to
> base a pretty sound conclusion upon, and the conclusion I would draw is
> that hardware and memory are not the cause of these MCE problems.
Wrong. memtest86 giving you errors almost always indicates a hardware
problem. You have changed the memory, but what remained consistent? The
memory bus! Try a new system board.
> I also tried something else that had an enormous positive effect on the
> situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS
All you have done is turn off SSE2 instructions and possibly a few
others that the P4s have and the P3s do not. If something is wrong with
your system board or CPU, less stress on the CPU is likely not to show
problems as often.
You have bad hardware, Kevin. Try the compile test with one CPU at a
time (i.e. take one out), and if that is not illuminating, replace the
system board.
--
Josh Glover
Gentoo Developer (http://dev.gentoo.org/~jmglov/)
Tokyo Linux Users Group Listmaster (http://www.tlug.jp/)
GPG keyID 0xDE8A3103 (C3E4 FA9E 1E07 BBDB 6D8B 07AB 2BF1 67A1 DE8A 3103)
gpg --keyserver pgp.mit.edu --recv-keys DE8A3103
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-18 8:29 ` Kevin
2004-05-18 10:59 ` Alexander Futasz
2004-05-18 12:02 ` Josh Glover
@ 2004-05-18 12:46 ` Daniel Drake
2 siblings, 0 replies; 41+ messages in thread
From: Daniel Drake @ 2004-05-18 12:46 UTC (permalink / raw
To: Kevin; +Cc: gentoo-dev
Hi,
Kevin wrote:
> Although I'm sure there are others here with more experience
> troubleshooting such problems, I'm thinking that the above is enough to
> base a pretty sound conclusion upon, and the conclusion I would draw is
> that hardware and memory are not the cause of these MCE problems. I
> welcome anyone contradicting that conclusion because I've never seen
> anything like this before and I'm at a loss on how to resolve it. I'm
> tempted to try replacing one of the CPUs to see if identical stepping
> levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are
> otherwise identical) will resolve the problem.
I have seen similar behaviour in previous experience (with uniprocessor
boards), i.e. all memory sticks (even confirmed good ones) bring up errors in
the same place when plugged into the board in question.
I've never attempted to look in detail for the cause of the problem, you'd
suspect a faulty memory controller of some sort. In my experience, I've just
replaced the board, and that has solved the problem.
Daniel
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
2004-05-18 12:02 ` Josh Glover
@ 2004-05-19 17:48 ` Kevin
2004-05-20 12:19 ` [gentoo-dev] SOLVED: " Kevin
2004-05-20 21:16 ` Kevin
0 siblings, 2 replies; 41+ messages in thread
From: Kevin @ 2004-05-19 17:48 UTC (permalink / raw
To: gentoo-dev
Thanks again for the replies, folks.
Well, I've now replaced the system motherboard, the CPU (first tried
removing one CPU and memtest86 behaved the exact same way, then replaced
the CPU with the new one), and the RAM. Results: memtest86 and friends
all behave the exact same way. Could this still be a hardware problem?
I'm hard-pressed to believe that I have two different motherboards that
just happen to suffer from the same flaw (they are not even the same
exact version: one is version B2 and the other is version C4). The only
things that are common between the system now and the system before are:
(1) the SCSI controller card (RAID card) (another SCSI controller was
replaced with the m/b), (2) 2 SCSI hard drives connected to the RAID
card, (3) a PCI hardware controller based modem, and (4) the SCSI
hot-plug backplane. Could one of these be causing the problem? I
haven't tried reproducing my MCE 0004 error again, but memtest86 shows no
difference. Can anyone buy into the notion now that memtest86 is doing
something that it shouldn't be doing when testing this system? Again,
the Dell Utilities are all turning up flawless. I've set the
configuration in memtest86 to limit the address range it tests to those
addresses below 1022MB or RAM (this is what the Dell utilities test with
1024MB RAM installed), but it ignores those limits and tests up to 1024
anyway and that's where it's still finding its errors (1023.8MB). I've
configured memtest86 to turn on ECC testing and it refuses to do so (when
I touch (8) for restart tests, the setting returns to off). What's going
on here?
Any thoughts are most welcome. I'll be trying to reproduce my MCE error
with this new hardware, and I'll post results when I have them.
Thanks again for all the replies.
On Tuesday 18 May 2004 08:02, Josh Glover wrote:
> Quoth Kevin (Tue 2004-05-18 04:29:58AM -0400):
[...]
> > True. Although it is locking up after only 1-2 minutes of operation.
> > What conclusion should I draw from that?
>
> Bad system board. :(
I just replaced it. Still does the same thing.
>
> > Although I'm sure there are others here with more experience
> > troubleshooting such problems, I'm thinking that the above is enough
> > to base a pretty sound conclusion upon, and the conclusion I would
> > draw is that hardware and memory are not the cause of these MCE
> > problems.
>
> Wrong. memtest86 giving you errors almost always indicates a hardware
> problem. You have changed the memory, but what remained consistent? The
> memory bus! Try a new system board.
New system board includes a new memory bus. Still get the same results.
>
> > I also tried something else that had an enormous positive effect on
> > the situation---I changed -march=pentium4 to -march=pentium3 in my
> > CFLAGS
>
> All you have done is turn off SSE2 instructions and possibly a few
> others that the P4s have and the P3s do not. If something is wrong with
> your system board or CPU, less stress on the CPU is likely not to show
> problems as often.
That's a good point. I'll try reproducing the MCE now with the new
hardware.
>
> You have bad hardware, Kevin. Try the compile test with one CPU at a
> time (i.e. take one out), and if that is not illuminating, replace the
> system board.
Thanks again gents!
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-19 17:48 ` Kevin
@ 2004-05-20 12:19 ` Kevin
2004-05-20 21:16 ` Kevin
1 sibling, 0 replies; 41+ messages in thread
From: Kevin @ 2004-05-20 12:19 UTC (permalink / raw
To: gentoo-dev
Hi All-
A final note to this thread.
After trying for many hours of high-intensity cpu activity (like emerging
many packages---which is what used to cause the MCE), since replacing my
stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have
two identical cpus, even in stepping levels (whereas this was not true
before), I have been unable to reproduce my MCE 0004 error. I even did
this with the kernel compiled with -march=pentium4 CFLAGS (that caused an
MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and
stepping level 9 cpus installed).
Naturally, I'm delighted by this, however, the whole experience has been
somewhat confusing (although enlightening in many respects).
Memtest86 still behaves exactly as it did before the hardware replacement.
Any thoughts on why it behaves this way with this hardware (unable to set
address range limits, unable to force ECC testing on, program locks after
2 minutes of operation). I suppose that's a question for another thread
in another forum.
I seem to have suffered from no hardware failures on the M/B, the CPUs
(one of the old CPUs is still present---I replaced the other), or the RAM
(although I suppose the stepping level 7 Xeon might have had some
incredibly subtle flaw that only showed up with another CPU present).
The replacement hardware seems to suffer no problems at all, in spite of
what Memtest86 does (fails at 1023.8MB 30 or 40 times and then freezes).
I really appreciate all of the suggestions here. You guys convinced me
that it was hardware which is why I replaced everything and that
ultimately solved the problem, although it's not clear that there was
really a hardware problem. The lesson I've learned (though I'm not sure
this is really the root issue) is that when doing multi-processor
computing, make sure that both processors are identical in every way.
Any thoughts on the accuracy of this rule?
But the bizarre thing is that I couldn't reproduce this MCE at all using
another distribution on the same (pre-replacement) hardware. Does Gentoo
push the hardware much harder than other distros? Perhaps because I'm
compiling the code for my particular hardware vice running code that was
built to run on many different sets of hardware (less aggressive CFLAGS
et. al.)? I'm at a loss to explain this.
Again, many thanks for all the help here.
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-19 17:48 ` Kevin
2004-05-20 12:19 ` [gentoo-dev] SOLVED: " Kevin
@ 2004-05-20 21:16 ` Kevin
2004-05-20 21:32 ` Greg KH
` (2 more replies)
1 sibling, 3 replies; 41+ messages in thread
From: Kevin @ 2004-05-20 21:16 UTC (permalink / raw
To: gentoo-dev
Hi All-
A final note to this thread.
After trying for many hours of high-intensity cpu activity (like emerging
many packages---which is what used to cause the MCE), since replacing my
stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have
two identical cpus, even in stepping levels (whereas this was not true
before), I have been unable to reproduce my MCE 0004 error. I even did
this with the kernel compiled with -march=pentium4 CFLAGS (that caused an
MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and
stepping level 9 cpus installed).
Naturally, I'm delighted by this, however, the whole experience has been
somewhat confusing (although enlightening in many respects).
Memtest86 still behaves exactly as it did before the hardware replacement.
Any thoughts on why it behaves this way with this hardware (unable to set
address range limits, unable to force ECC testing on, program locks after
2 minutes of operation). I suppose that's a question for another thread
in another forum.
I seem to have suffered from no hardware failures on the M/B, the CPUs
(one of the old CPUs is still present---I replaced the other), or the RAM
(although I suppose the stepping level 7 Xeon might have had some
incredibly subtle flaw that only showed up with another CPU present).
The replacement hardware seems to suffer no problems at all, in spite of
what Memtest86 does (fails at 1023.8MB 30 or 40 times and then freezes).
I really appreciate all of the suggestions here. You guys convinced me
that it was hardware which is why I replaced everything and that
ultimately solved the problem, although it's not clear that there was
really a hardware problem. The lesson I've learned (though I'm not sure
this is really the root issue) is that when doing multi-processor
computing, make sure that both processors are identical in every way.
Any thoughts on the accuracy of this rule?
But the bizarre thing is that I couldn't reproduce this MCE at all using
another distribution on the same (pre-replacement) hardware. Does Gentoo
push the hardware much harder than other distros? Perhaps because I'm
compiling the code for my particular hardware vice running code that was
built to run on many different sets of hardware (less aggressive CFLAGS
et. al.)? I'm at a loss to explain this.
Again, many thanks for all the help here.
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-20 21:16 ` Kevin
@ 2004-05-20 21:32 ` Greg KH
2004-05-20 23:08 ` Robin H. Johnson
2004-05-21 13:05 ` Chris Gianelloni
2 siblings, 0 replies; 41+ messages in thread
From: Greg KH @ 2004-05-20 21:32 UTC (permalink / raw
To: Kevin; +Cc: gentoo-dev
On Thu, May 20, 2004 at 05:16:25PM -0400, Kevin wrote:
> The lesson I've learned (though I'm not sure
> this is really the root issue) is that when doing multi-processor
> computing, make sure that both processors are identical in every way.
> Any thoughts on the accuracy of this rule?
That is a _very_ good rule to stick with, I know many problems go away
if you follow it. You never mentioned that this was the case with your
hardware, or I would have mentioned it earlier, sorry.
Glad it's all working for you, and you can stick with Gentoo :)
thanks,
greg k-h
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-20 21:16 ` Kevin
2004-05-20 21:32 ` Greg KH
@ 2004-05-20 23:08 ` Robin H. Johnson
2004-05-20 23:16 ` Hasse Hagen Johansen
2004-05-21 13:05 ` Chris Gianelloni
2 siblings, 1 reply; 41+ messages in thread
From: Robin H. Johnson @ 2004-05-20 23:08 UTC (permalink / raw
To: Gentoo Developers
[-- Attachment #1: Type: text/plain, Size: 2171 bytes --]
On Thu, May 20, 2004 at 05:16:25PM -0400, Kevin wrote:
> Hi All-
>
> A final note to this thread.
>
> After trying for many hours of high-intensity cpu activity (like emerging
> many packages---which is what used to cause the MCE), since replacing my
> stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have
> two identical cpus, even in stepping levels (whereas this was not true
> before), I have been unable to reproduce my MCE 0004 error. I even did
> this with the kernel compiled with -march=pentium4 CFLAGS (that caused an
> MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and
> stepping level 9 cpus installed).
If you'd pointed out your cpus were different in the first place, that
would have been the first thing to change.
the term is SMP - _Symmetrical_ Multi-Processing on purpose, the CPUs
need to be identical. I'm surprised it even worked to the degree it did.
> Does Gentoo push the hardware much harder than other distros?
Yup, running with gentoo optimizations is a LOT harder on machines.
> Perhaps because I'm compiling the code for my particular hardware vice
> running code that was built to run on many different sets of hardware
> (less aggressive CFLAGS et. al.)? I'm at a loss to explain this.
As an example of this, with GCC maxed on out on '-O3 -march=pentium4
-fomit-frame-pointers', and trying the same CFLAGS to compile MySQL, I
can crash GCC with some frequency on certain hardware. Yet 100%
identical hardware in an adajcent server compiles the same fine. Both
machines are 1U intel servers (single 2.66ghz p4 xeon cpu, dual cpu
board, 1gb ram, 3ware raid1 - 40gb), from the same batch (sequential
serial numbers).
It boils down to the fact that the hardware shipped out is good enough
to withstand the burn-in tests, but the acceptance point of the burn-in
tests is lower than the stress placed on the machine by Gentoo.
--
Robin Hugh Johnson
E-Mail : robbat2@orbis-terrarum.net
Home Page : http://www.orbis-terrarum.net/?l=people.robbat2
ICQ# : 30269588 or 41961639
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-20 23:08 ` Robin H. Johnson
@ 2004-05-20 23:16 ` Hasse Hagen Johansen
2004-05-21 2:46 ` Kevin
0 siblings, 1 reply; 41+ messages in thread
From: Hasse Hagen Johansen @ 2004-05-20 23:16 UTC (permalink / raw
To: gentoo-dev
>>>>> "Robin" == Robin H Johnson <robbat2@gentoo.org> writes:
Robin> If you'd pointed out your cpus were different in the first
Robin> place, that would have been the first thing to change. the
Robin> term is SMP - _Symmetrical_ Multi-Processing on purpose,
Robin> the CPUs need to be identical. I'm surprised it even worked
Robin> to the degree it did.
He did point it out early on :-)
/Hasse
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-20 23:16 ` Hasse Hagen Johansen
@ 2004-05-21 2:46 ` Kevin
0 siblings, 0 replies; 41+ messages in thread
From: Kevin @ 2004-05-21 2:46 UTC (permalink / raw
To: gentoo-dev
On Thursday 20 May 2004 19:16, Hasse Hagen Johansen wrote:
> >>>>> "Robin" == Robin H Johnson <robbat2@gentoo.org> writes:
>
> Robin> If you'd pointed out your cpus were different in the first
> Robin> place, that would have been the first thing to change. the
> Robin> term is SMP - _Symmetrical_ Multi-Processing on purpose,
> Robin> the CPUs need to be identical. I'm surprised it even worked
> Robin> to the degree it did.
>
> He did point it out early on :-)
Thanks, Hasse. Glad somebody noticed. :-)
Greg/Robin, in my defense, I feel I must point out:
It was in my first post (/proc/cpuinfo output),
again here:
On Thursday 13 May 2004 07:06, Kevin wrote:
> Does anyone think that my two CPUs having different stepping levels
> could have anything to do with this problem? One is level 7 and the
> other 9.
and again here:
On Tuesday 18 May 2004 04:29, Kevin wrote:
> anything like this before and I'm at a loss on how to resolve it. I'm
> tempted to try replacing one of the CPUs to see if identical stepping
> levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are
> otherwise identical) will resolve the problem.
Thanks again for all the help, folks, and I too am extremely delighted
that I can stay with Gentoo. Over the past 24+ hours as I've been
pushing this box to the limit with emerge this and emerge that, upgrading
major packages at the snap of a finger, having two different versions of
some packages installed in different slots, and building everything from
source, it's easy to remember why I struggled so hard to stay with
Gentoo. It really does represent a terrific improvement on the standard
distros.
--
-Kevin
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
2004-05-20 21:16 ` Kevin
2004-05-20 21:32 ` Greg KH
2004-05-20 23:08 ` Robin H. Johnson
@ 2004-05-21 13:05 ` Chris Gianelloni
2 siblings, 0 replies; 41+ messages in thread
From: Chris Gianelloni @ 2004-05-21 13:05 UTC (permalink / raw
To: Kevin; +Cc: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1008 bytes --]
On Thu, 2004-05-20 at 17:16, Kevin wrote:
> But the bizarre thing is that I couldn't reproduce this MCE at all using
> another distribution on the same (pre-replacement) hardware. Does Gentoo
> push the hardware much harder than other distros? Perhaps because I'm
> compiling the code for my particular hardware vice running code that was
> built to run on many different sets of hardware (less aggressive CFLAGS
> et. al.)? I'm at a loss to explain this.
Simply... Yes. You are using, even at -march=pentium3, the MMX and SSE
portions of the chip, which may not be used at all on another
distribution (compiled -march=i586) at all.
CFLAGS, as both the Gnome and KDE teams can attest, can make a world of
difference on how things come out in the end.
As for the memtest86 problem, who knows... ask the memtest86 guys.
They'd probably be really interested in your findings.
--
Chris Gianelloni
Developer
Games/LiveCD Teams
Gentoo Linux
Is your power animal a penguin?
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2004-05-26 5:18 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin
2004-05-11 18:46 ` Greg KH
2004-05-11 18:55 ` Kevin
2004-05-11 19:04 ` Greg KH
2004-05-11 19:38 ` Kevin
2004-05-11 20:54 ` Chris Gianelloni
2004-05-11 21:31 ` Kevin
2004-05-11 19:38 ` Paul de Vrieze
2004-05-11 21:37 ` Kevin
2004-05-12 1:02 ` Georgi Georgiev
2004-05-12 10:23 ` [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels) sf
2004-05-12 2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
2004-05-12 9:31 ` Dan Podeanu
2004-05-12 11:26 ` Kevin
2004-05-12 11:24 ` Kevin
2004-05-12 11:48 ` Josh Glover
2004-05-12 12:14 ` Ciaran McCreesh
2004-05-12 13:58 ` Kevin
2004-05-12 14:44 ` Chris Gianelloni
2004-05-12 15:17 ` tom_gall
2004-05-13 11:06 ` Kevin
2004-05-13 11:12 ` Senor Rodgman
2004-05-13 13:04 ` Chris Gianelloni
2004-05-13 15:04 ` Daniel Drake
2004-05-13 15:54 ` Greg KH
2004-05-18 8:29 ` Kevin
2004-05-18 10:59 ` Alexander Futasz
2004-05-18 12:02 ` Josh Glover
2004-05-19 17:48 ` Kevin
2004-05-20 12:19 ` [gentoo-dev] SOLVED: " Kevin
2004-05-20 21:16 ` Kevin
2004-05-20 21:32 ` Greg KH
2004-05-20 23:08 ` Robin H. Johnson
2004-05-20 23:16 ` Hasse Hagen Johansen
2004-05-21 2:46 ` Kevin
2004-05-21 13:05 ` Chris Gianelloni
2004-05-18 12:46 ` [gentoo-dev] " Daniel Drake
[not found] ` <40A23987.9080104@gentoo.org>
2004-05-12 16:22 ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin
2004-05-12 16:59 ` Greg KH
2004-05-12 17:18 ` Scott Myron
2004-05-12 17:15 ` Sven Vermeulen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox