From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by nuthatch.gentoo.org with esmtp (Exim 4.43) id 1DxpPM-0000nx-3w for garchives@archives.gentoo.org; Wed, 27 Jul 2005 17:09:28 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.13.4/8.13.4) with SMTP id j6RH6wKY005697; Wed, 27 Jul 2005 17:06:58 GMT Received: from siolinh (siolinh.obspm.fr [145.238.2.86]) by robin.gentoo.org (8.13.4/8.13.4) with ESMTP id j6RH6vtX002199 for ; Wed, 27 Jul 2005 17:06:57 GMT Received: from jborsen (helo=localhost) by siolinh with local-esmtp (Exim 3.35 #1 (Debian)) id 1DxpNF-0005n6-00 for ; Wed, 27 Jul 2005 19:07:17 +0200 Date: Wed, 27 Jul 2005 19:07:17 +0200 (CEST) From: Jean.Borsenberger@obspm.fr X-X-Sender: jborsen@siolinh.obspm.fr To: gentoo-amd64@lists.gentoo.org Subject: Re: [gentoo-amd64] Re: Re: amd64 and kernel configuration In-Reply-To: Message-ID: References: <20050727062947.73020.qmail@mail.mng.mn> <25f58b7910e09fd5453bb3ec534330d1@xsmail.com> <20050727075012.79549.qmail@mail.mng.mn> <81469e8e0507270346445f4363@mail.gmail.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: Jean Borsenberger X-Archives-Salt: 0ff2a24b-2343-4a2d-81a6-4c39705720a9 X-Archives-Hash: be9d5ca611ceaf24852a9cb45b9cf325 Well, may be it's SUMO, but when we swicth on the NUMA option for the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP benchs between 20% to 30% (depending on the program considered). Note: OPenMP is a FORTRAN variation in which you put paralelisation directives, without boring of the implementation details, using a single address space, for all instances of the user program. Jean Borsenberger tel: +33 (0)1 45 07 76 29 Observatoire de Paris Meudon 5 place Jules Janssen 92195 Meudon France On Wed, 27 Jul 2005, Duncan wrote: > Drew Kirkpatrick posted <81469e8e0507270346445f4363@mail.gmail.com>, > excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500: > > > Just to point out, amd was calling the opterons and such more of a SUMO > > configuration (Sufficiently Uniform Memory Organization, not joking here), > > instead of NUMA. Whereas technically, it clearly is a NUMA system, the > > differences in latency when accessing memory from a bank attached to > > another processors memory controller is very small. Small enough to be > > largely ignored, and treated like uniform memory access latencies in a SMP > > system. Sorta in between SMP unified style memory access and NUMA. This > > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You > > add hypertransport switches to scale over 8 chips/sockets, it'll most > > likely be a different story... > > I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given > the design of the hardware. They have a very good point, that while it's > physically NUMA, the latencies variances are so close to unified that in > many ways it's indistinguishable -- except for the fact that keeping it > NUMA means allowing independent access of two different apps running on > two different CPUs, to their own memory in parallel, rather than one > having to wait for the other, if the memory were interleaved and unified > (as it would be for quad channel access, if that were enabled). > > > What I've always wondered is, the NUMA code in the linux kernel, is this > > for handling traditional NUMA, like in a large computer system (big iron) > > where NUMA memory access latencies will vary greatly, or is it simply for > > optimizing the memory usage across the memory banks. Keeping data in the > > memory of the processor using it, etc, etc. Of course none of this matters > > for single chip/socket amd systems, as dual cores as well as single cores > > share a memory controller. Hmm, maybe I should drink some coffee and > > shutup until I'm awake... > > Well, yeah, for single-socket/dual-core, but what about dual socket > (either single core or dual core)? Your questions make sense there, and > that's what I'm running (single core, tho upgrading to dual core for a > quad-core total board sometime next year, would be very nice, and just > might be within the limits of my budget), so yes, I'm rather interested! > > The answer to your question on how the kernel deals with it, by my > understanding, is this: The Linux kernel SMP/NUMA architecture allows for > "CPU affinity grouping". In earlier kernels, it was all automated, but > they are actually getting advanced enough now to allow deliberate manual > splitting of various groups, and combined with userspace control > applications, will ultimately be able to dynamically assign processes to > one or more CPU groups of various sizes, controlling the CPU and memory > resources available to individual processes. So, yes, I guess that means > it's developing some pretty "big iron" qualities, altho many of them are > still in flux and won't be stable at least in mainline for another six > months or a year, at minimum. > > Let's refocus now back on the implementation and the smaller picture once > again, to examine these "CPU affinity zones" in a bit more detail. The > following is according to the writeups I've seen, mostly on LWN's weekly > kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing > the technical kernel hacker level stuff with the middle-ground > not-too-technical kernel follower stuff, good enough that I find the site > useful enough to subscribe, even tho I could get even the premium content > a week later for free. Yes, that's an endorsement of the site, because > it's where a lot of my info comes from, and I'm certainly not one to try > to keep my knowledge exclusive!) > > Anyway... from mainly that source... CPU affinity zones work with sets > and supersets of processors. An Intel hyperthreading pair of virtual > processors on the same physical processor will be at the highest affinity > level, the lowest level aka strongest grouping in the hierarchy, because > they share the same cache memory all the way up to L1 itself, and the > Linux kernel can switch processes between the two virtual CPUs of a > hyperthreaded CPU with zero cost or loss in performance, therefore only > taking into account the relative balance of processes on each of the > hyperthread virtual CPUs. > > At the next lowest level affinity, we'd have the dual-core AMDs, same > chip, same memory controller, same local memory, same hypertransport > interfaces to the chipset, other CPUs and the rest of the world, and very > tightly cooperative, but with separate L2 and of course separate L1 cache. > There's a slight performance penalty between switching processes between > these CPUs, due to the cache flushing it would entail, but it's only very > slight and quite speedy, so thread imbalance between the two processors > doesn't have to get bad at all, before it's worth it to switch the CPUs to > maintain balance, even at the cost of that cache flush. > > At a slightly lower level of affinity would be the Intel dual cores, since > they aren't quite so tightly coupled, and don't share all the same > interfaces to the outside world. In practice, since only one of these, > the Intel dual core or the AMD dual core, will normally be encountered in > real life, they can be treated at the same level, with possibly a small > internal tweak to the relative weighting of thread imbalance vs > performance loss for switching CPUs, based on which one is actually in > place. > > Here things get interesting, because of the different implementations > available. AMD's 2-way thru 8-way Opterons configured for unified memory > access would be first, because again, their dedicated inter-CPU > hypertransport links let them cooperate closer than conventional > multi-socket CPUs would. Beyond that, it's a tossup between Intel's > unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd > still say the Opterons cooperate closer, even in NUMA/SUMO mode, than > Intel chips will with unified memory, due to that SUMO aspect. At the > same time, they have the parallel memory access advantages of NUMA. > > Beyond that, there's several levels of clustering, local/board, off-board > but short-fat-pipe accessible (using technologies such as PCI > interconnect, fibre-channel, and that SGI interconnect tech IDR the name > of ATM), conventional (and Beowulf?) type clustering, and remote > clustering. At each of these levels, as with the above, the cost to switch > processes between peers at the same affinity level gets higher and higher, > so the corresponding process imbalance necessary to trigger a switch > likewise gets higher and higher, until at the extreme of remote > clustering, it's almost done manually only. or anyway at the level of a > user level application managing the transfers, rather than the kernel, > directly (since, after all, with remote clustering, each remote group is > probably running its own kernel, if not individual machines within that > group). > > So, the point of all that is that the kernel sees a hierarchical grouping > of CPUs, and is designed with more flexibility to balance processes and > memory use at the extreme affinity end, and more hesitation to balance it > due to the higher cost involved, at the extremely low affinity end. The > main writeup I read on the subject dealt with thread/process CPU > switching, not memory switching, but within the context of NUMA, the > principles become so intertwined it's impossible to separate them, and the > writeup very clearly made the point that the memory issues involved in > making the transfer were included in the cost accounting as well. > > I'm not sure whether this addressed the point you were trying to make, or > hit beside it, but anyway, it was fun trying to put into text for the > first time since I read about it, the principles in that writeup, along > with other facts I've merged along the way. My dad's a teacher, and I > remember him many times making the point that the best way to learn > something is to attempt to teach it. He used that principle in his own > classes, having the students help each other, and I remember him making > the point about himself as well, at one point, as he struggled to teach > basic accounting principles based only on a textbook and the single > college intro level class he had himself taken years before, when he found > himself teaching a high school class on the subject. The principle is > certainly true, as by explaining the affinity clustering principles here, > it has forced me to ensure they form a reasonable and self consistent > infrastructure in my own head, in ordered to be able to explain it in the > post. So, anyway, thanks for the intellectual stimulation! > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman in > http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html > > > -- > gentoo-amd64@gentoo.org mailing list > > -- gentoo-amd64@gentoo.org mailing list