public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
@ 2004-05-11 18:07 Kevin
  2004-05-11 18:46 ` Greg KH
  2004-05-12  2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
  0 siblings, 2 replies; 41+ messages in thread
From: Kevin @ 2004-05-11 18:07 UTC (permalink / raw
  To: Gentoo Dev

Hi All-

I'm writing here first before reporting a bug because perhaps I'm missing 
something important here (and because I'm not sure what details to supply 
if I do report a bug because I'm not sure if the problem lies with the 
gentoo kernels or with gcc or something else).  If I am missing 
something, however, I'm not the only Gentoo user who's missing it, so I 
think that's unlikely.  I saw a thread on lkml in March from somebody 
else with extremely similar circumstances---though not identical---and 
running Gentoo---he thought it was a kernel bug but I don't think so:
see 
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=ISO-8859-1&threadm=1yyJD-8mD-11%
40gated-at.bofh.it&rnum=6&prev=/groups%3Fq%3Dgroup:linux.kernel%2Bsmp%
2Bgentoo%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DISO-8859-1%26sa%3DG%
26scoring%3Dd

or search for "group:linux.kernel smp gentoo" on google groups,

or see lkml thread: SMP + Hyperthreading / Asus PCDL Deluxe / Kernel 2.4.x 
2.6.x / Crash/Freeze).

Instead, I think the most likely explanation for my problem is a bug in 
some Gentoo code somewhere, perhaps related to building kernels, but 
maybe not... Maybe related to building gcc itself?  Not sure.

In summary, my problem is this: of those that I've tried, I can't get any 
Gentoo kernel to handle SMP operation during major CPU activity (like 
emerging packages) for more than about 5 or 10 minutes.  Invariably, 
during such activity, I get a kernel panic---most often with words on the 
console about Machine Check Exception 000000...004 (this number from 
memory so it may be off).

The only way that I can get reliable, stable operation with a Gentoo 
kernel and distribution is if I build a kernel without support for SMP.  
This is stable with or without hyperthreading enabled in CMOS.  Over the 
last week or so, I've tried running kernels with the CMOS setting for 
hyperthreading disabled and enabled, with support for SMP enabled and 
disabled, in all combinations and for the latest stable ebuilds of the 
following Gentoo kernels: vanilla-sources, gentoo-sources, 
gentoo-dev-sources, gs-sources (actually, I couldn't even get this to 
build---see bug #48973 and thread here: gs-sources problems: device 
mapper: dm.o has undeclared identifiers).  Going by memory, I tried 
kernel versions 2.4.25 (vanilla?), 2.4.26 (gentoo?), and 2.6.5 
(gentoo-dev?).

In all of the above circumstances, when running a kernel with support for 
SMP and when emerging packages (some pretty small ones, so only about 3 
or 5 minutes of compiling), the machine would lock with a kernel panic 
and need a hard reset.

The machine is a Dell PowerEdge1600SC with a PERC-3/SC SCSI RAID 
controller (using AMI megaraid2 driver) and a LSI Logic Corp controller 
(using Fusion MPT base driver) for the SCSI DAT and with dual 2.4GHz Xeon 
processors, each having a 512KB L2 Cache.

Output from /proc/cpuinfo is:
=======
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 7
cpu MHz         : 2392.127
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips        : 4771.02

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 7
cpu MHz         : 2392.127
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips        : 4771.02

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 9
cpu MHz         : 2392.127
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips        : 4771.02

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 9
cpu MHz         : 2392.127
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips        : 4771.02
=======

I wrote some details about this problem in gentoo-user under the thread: 
2004.1 and SMP Problems, but since then, have done lots more testing.

The reason that I think this is a Gentoo thing and not a kernel thing is 
that today I just finished installing SuSE9 on this same machine with 
CMOS hyperthreading setting enabled and the CPUs have been wailing away 
for hours doing simultaneous builds of several different source tarballs 
(bind9, kde3.2.2, mysql 4.0.18), and I haven't seen even a single 
problem.  During these tests, I was running the SuSE kernel 
2.4.21-215-smp4G.

In SuSE, the output of /proc/cpuinfo is close, but not exactly the same as 
above.  There are some differences in the flags and a couple other things 
(use diff for specifics).

SuSE /proc/cpuinfo:

=======
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 7
cpu MHz         : 2392.795
cache size      : 512 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4718.59

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 7
cpu MHz         : 2392.795
cache size      : 512 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4767.74

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 9
cpu MHz         : 2392.795
cache size      : 512 KB
physical id     : 2
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4767.74

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
stepping        : 9
cpu MHz         : 2392.795
cache size      : 512 KB
physical id     : 2
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4767.74
=======

Gentoo emerge info output:
=======
System uname: 2.4.25-gentoo-r2 i686 Intel(R) Xeon(TM) CPU 2.40GHz 
Gentoo Base System version 1.4.9 
distcc 2.13 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) 
[enabled] 
ccache version 2.3 [enabled] 
Autoconf: sys-devel/autoconf-2.58-r1 
Automake: sys-devel/automake-1.8.3 
ACCEPT_KEYWORDS="x86" 
AUTOCLEAN="yes" 
CFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" 
CHOST="i686-pc-linux-gnu" 
COMPILER="gcc3" 
CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.2/share/config /usr/kde/3/share/config /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control" 
CONFIG_PROTECT_MASK="/etc/afs/C /etc/afs/afsws /etc/gconf /etc/terminfo /etc/env.d" 
CXXFLAGS="-O3 -march=pentium4 -pipe -fomit-frame-pointer" 
DISTDIR="/usr/portage/distfiles" 
FEATURES="autoaddcvs ccache distcc sandbox" 
GENTOO_MIRRORS="http://128.213.5.34/gentoo/ 
http://mirror.datapipe.net/gentoo 
ftp://mirrors.sec.informatik.tu-darmstadt.de/gentoo/ 
http://gentoo.eliteitminds.com" 
MAKEOPTS="-j3" 
PKGDIR="/usr/portage/packages" 
PORTAGE_TMPDIR="/var/tmp" 
PORTDIR="/usr/portage" 
PORTDIR_OVERLAY="" 
SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage" 
USE="X Xaw3d acl acpi afs alsa apache2 apm arts avi berkdb bonobo caps 
crypt 
cups doc emacs emacs-w3 encode esd ethereal evo firebird flac foomaticdb 
gdbm 
gif gnome gpm gstreamer gtk gtk2 gtkhtml guile hardened icq imagemagick 
imap 
imlib innodb ipv6 jabber jack java jikes jpeg kde kerberos krb4 ldap 
libg++ 
libwww mad mcal mikmod motif mozilla mpeg mysql ncurses nls odbc oggvorbis 
opengl oss pam pda pdflib perl plotutils png ppds prelude python qt 
quicktime 
readline ruby samba sasl sdl slang slp spell sse ssl svga tcltk tcpd tetex 
tiff 
truetype unicode usb vhosts x86 xinerama xml2 xmms xv zeo zlib" 
=======

I'm a recent Gentoo convert.  I think it's an excellent improvement on the 
traditional Linux distros, and I'd really like to use it on my server, 
but as long as this problem with SMP is present, I just can't.

If anyone has any suggestions on what I might be doing wrong and how I can 
get a stable gentoo system with full support for SMP (ideally in a 2.4.x 
kernel since I need OpenAFS and OpenAFS doesn't work with 2.6.x right 
now; nor for the near future say the developers of OAFS), I would really 
appreciate getting your thoughts.

Thanks in advance.

-- 
-Kevin

PS. FWIW, I'll add that I have a very vague memory while watching text fly 
up the screen during bootstrap.sh or emerge system (this was a stage 1 
install before I installed SuSE over it) of seeing some warning about 
something being unsafe with SMP.  Do I need to have some setting or other 
turned off for some parts of a stage 1 install with a dual CPU system?


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2004-05-26  5:18 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-11 18:07 [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Kevin
2004-05-11 18:46 ` Greg KH
2004-05-11 18:55   ` Kevin
2004-05-11 19:04     ` Greg KH
2004-05-11 19:38       ` Kevin
2004-05-11 20:54         ` Chris Gianelloni
2004-05-11 21:31           ` Kevin
2004-05-11 19:38     ` Paul de Vrieze
2004-05-11 21:37       ` Kevin
2004-05-12  1:02         ` Georgi Georgiev
2004-05-12 10:23           ` [gentoo-dev] [OT] SuSE kernel on gentoo system (Was: Re: Major MCE problem with SMP on Gentoo kernels) sf
2004-05-12  2:42 ` [gentoo-dev] Major MCE problem with SMP on Gentoo kernels Josh Glover
2004-05-12  9:31   ` Dan Podeanu
2004-05-12 11:26     ` Kevin
2004-05-12 11:24   ` Kevin
2004-05-12 11:48     ` Josh Glover
2004-05-12 12:14       ` Ciaran McCreesh
2004-05-12 13:58       ` Kevin
2004-05-12 14:44         ` Chris Gianelloni
2004-05-12 15:17         ` tom_gall
2004-05-13 11:06           ` Kevin
2004-05-13 11:12             ` Senor Rodgman
2004-05-13 13:04             ` Chris Gianelloni
2004-05-13 15:04             ` Daniel Drake
2004-05-13 15:54             ` Greg KH
2004-05-18  8:29               ` Kevin
2004-05-18 10:59                 ` Alexander Futasz
2004-05-18 12:02                 ` Josh Glover
2004-05-19 17:48                   ` Kevin
2004-05-20 12:19                     ` [gentoo-dev] SOLVED: " Kevin
2004-05-20 21:16                     ` Kevin
2004-05-20 21:32                       ` Greg KH
2004-05-20 23:08                       ` Robin H. Johnson
2004-05-20 23:16                         ` Hasse Hagen Johansen
2004-05-21  2:46                           ` Kevin
2004-05-21 13:05                       ` Chris Gianelloni
2004-05-18 12:46                 ` [gentoo-dev] " Daniel Drake
     [not found]         ` <40A23987.9080104@gentoo.org>
2004-05-12 16:22           ` [gentoo-dev] memtest86 fails? (was Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels) Kevin
2004-05-12 16:59             ` Greg KH
2004-05-12 17:18               ` Scott Myron
2004-05-12 17:15             ` Sven Vermeulen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox