From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by nuthatch.gentoo.org with esmtp (Exim 4.43) id 1Dxo64-0006wg-VF for garchives@archives.gentoo.org; Wed, 27 Jul 2005 15:45:29 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.13.4/8.13.4) with SMTP id j6RFhnQo010032; Wed, 27 Jul 2005 15:43:49 GMT Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by robin.gentoo.org (8.13.4/8.13.4) with ESMTP id j6RFhmnm017077 for ; Wed, 27 Jul 2005 15:43:48 GMT Received: from list by ciao.gmane.org with local (Exim 4.43) id 1Dxo4Z-0002vz-FW for gentoo-amd64@lists.gentoo.org; Wed, 27 Jul 2005 17:43:55 +0200 Received: from ip68-230-97-182.ph.ph.cox.net ([68.230.97.182]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 27 Jul 2005 17:43:55 +0200 Received: from 1i5t5.duncan by ip68-230-97-182.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 27 Jul 2005 17:43:55 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-amd64@lists.gentoo.org From: Duncan <1i5t5.duncan@cox.net> Subject: [gentoo-amd64] Re: Re: amd64 and kernel configuration Date: Wed, 27 Jul 2005 08:42:14 -0700 Organization: Sometimes Message-ID: References: <20050727062947.73020.qmail@mail.mng.mn> <25f58b7910e09fd5453bb3ec534330d1@xsmail.com> <20050727075012.79549.qmail@mail.mng.mn> <81469e8e0507270346445f4363@mail.gmail.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: ip68-230-97-182.ph.ph.cox.net User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table) Sender: news X-Archives-Salt: cafb2f57-505d-4141-aaf5-cd7559da84f3 X-Archives-Hash: aac4f40e0e85e8e2cf331c079c2461c8 Drew Kirkpatrick posted <81469e8e0507270346445f4363@mail.gmail.com>, excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500: > Just to point out, amd was calling the opterons and such more of a SUMO > configuration (Sufficiently Uniform Memory Organization, not joking here), > instead of NUMA. Whereas technically, it clearly is a NUMA system, the > differences in latency when accessing memory from a bank attached to > another processors memory controller is very small. Small enough to be > largely ignored, and treated like uniform memory access latencies in a SMP > system. Sorta in between SMP unified style memory access and NUMA. This > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You > add hypertransport switches to scale over 8 chips/sockets, it'll most > likely be a different story... I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given the design of the hardware. They have a very good point, that while it's physically NUMA, the latencies variances are so close to unified that in many ways it's indistinguishable -- except for the fact that keeping it NUMA means allowing independent access of two different apps running on two different CPUs, to their own memory in parallel, rather than one having to wait for the other, if the memory were interleaved and unified (as it would be for quad channel access, if that were enabled). > What I've always wondered is, the NUMA code in the linux kernel, is this > for handling traditional NUMA, like in a large computer system (big iron) > where NUMA memory access latencies will vary greatly, or is it simply for > optimizing the memory usage across the memory banks. Keeping data in the > memory of the processor using it, etc, etc. Of course none of this matters > for single chip/socket amd systems, as dual cores as well as single cores > share a memory controller. Hmm, maybe I should drink some coffee and > shutup until I'm awake... Well, yeah, for single-socket/dual-core, but what about dual socket (either single core or dual core)? Your questions make sense there, and that's what I'm running (single core, tho upgrading to dual core for a quad-core total board sometime next year, would be very nice, and just might be within the limits of my budget), so yes, I'm rather interested! The answer to your question on how the kernel deals with it, by my understanding, is this: The Linux kernel SMP/NUMA architecture allows for "CPU affinity grouping". In earlier kernels, it was all automated, but they are actually getting advanced enough now to allow deliberate manual splitting of various groups, and combined with userspace control applications, will ultimately be able to dynamically assign processes to one or more CPU groups of various sizes, controlling the CPU and memory resources available to individual processes. So, yes, I guess that means it's developing some pretty "big iron" qualities, altho many of them are still in flux and won't be stable at least in mainline for another six months or a year, at minimum. Let's refocus now back on the implementation and the smaller picture once again, to examine these "CPU affinity zones" in a bit more detail. The following is according to the writeups I've seen, mostly on LWN's weekly kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing the technical kernel hacker level stuff with the middle-ground not-too-technical kernel follower stuff, good enough that I find the site useful enough to subscribe, even tho I could get even the premium content a week later for free. Yes, that's an endorsement of the site, because it's where a lot of my info comes from, and I'm certainly not one to try to keep my knowledge exclusive!) Anyway... from mainly that source... CPU affinity zones work with sets and supersets of processors. An Intel hyperthreading pair of virtual processors on the same physical processor will be at the highest affinity level, the lowest level aka strongest grouping in the hierarchy, because they share the same cache memory all the way up to L1 itself, and the Linux kernel can switch processes between the two virtual CPUs of a hyperthreaded CPU with zero cost or loss in performance, therefore only taking into account the relative balance of processes on each of the hyperthread virtual CPUs. At the next lowest level affinity, we'd have the dual-core AMDs, same chip, same memory controller, same local memory, same hypertransport interfaces to the chipset, other CPUs and the rest of the world, and very tightly cooperative, but with separate L2 and of course separate L1 cache. There's a slight performance penalty between switching processes between these CPUs, due to the cache flushing it would entail, but it's only very slight and quite speedy, so thread imbalance between the two processors doesn't have to get bad at all, before it's worth it to switch the CPUs to maintain balance, even at the cost of that cache flush. At a slightly lower level of affinity would be the Intel dual cores, since they aren't quite so tightly coupled, and don't share all the same interfaces to the outside world. In practice, since only one of these, the Intel dual core or the AMD dual core, will normally be encountered in real life, they can be treated at the same level, with possibly a small internal tweak to the relative weighting of thread imbalance vs performance loss for switching CPUs, based on which one is actually in place. Here things get interesting, because of the different implementations available. AMD's 2-way thru 8-way Opterons configured for unified memory access would be first, because again, their dedicated inter-CPU hypertransport links let them cooperate closer than conventional multi-socket CPUs would. Beyond that, it's a tossup between Intel's unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd still say the Opterons cooperate closer, even in NUMA/SUMO mode, than Intel chips will with unified memory, due to that SUMO aspect. At the same time, they have the parallel memory access advantages of NUMA. Beyond that, there's several levels of clustering, local/board, off-board but short-fat-pipe accessible (using technologies such as PCI interconnect, fibre-channel, and that SGI interconnect tech IDR the name of ATM), conventional (and Beowulf?) type clustering, and remote clustering. At each of these levels, as with the above, the cost to switch processes between peers at the same affinity level gets higher and higher, so the corresponding process imbalance necessary to trigger a switch likewise gets higher and higher, until at the extreme of remote clustering, it's almost done manually only. or anyway at the level of a user level application managing the transfers, rather than the kernel, directly (since, after all, with remote clustering, each remote group is probably running its own kernel, if not individual machines within that group). So, the point of all that is that the kernel sees a hierarchical grouping of CPUs, and is designed with more flexibility to balance processes and memory use at the extreme affinity end, and more hesitation to balance it due to the higher cost involved, at the extremely low affinity end. The main writeup I read on the subject dealt with thread/process CPU switching, not memory switching, but within the context of NUMA, the principles become so intertwined it's impossible to separate them, and the writeup very clearly made the point that the memory issues involved in making the transfer were included in the cost accounting as well. I'm not sure whether this addressed the point you were trying to make, or hit beside it, but anyway, it was fun trying to put into text for the first time since I read about it, the principles in that writeup, along with other facts I've merged along the way. My dad's a teacher, and I remember him many times making the point that the best way to learn something is to attempt to teach it. He used that principle in his own classes, having the students help each other, and I remember him making the point about himself as well, at one point, as he struggled to teach basic accounting principles based only on a textbook and the single college intro level class he had himself taken years before, when he found himself teaching a high school class on the subject. The principle is certainly true, as by explaining the affinity clustering principles here, it has forced me to ensure they form a reasonable and self consistent infrastructure in my own head, in ordered to be able to explain it in the post. So, anyway, thanks for the intellectual stimulation! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman in http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html -- gentoo-amd64@gentoo.org mailing list