From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org)
	by nuthatch.gentoo.org with esmtp (Exim 4.43)
	id 1DxpPM-0000nx-3w
	for garchives@archives.gentoo.org; Wed, 27 Jul 2005 17:09:28 +0000
Received: from robin.gentoo.org (localhost [127.0.0.1])
	by robin.gentoo.org (8.13.4/8.13.4) with SMTP id j6RH6wKY005697;
	Wed, 27 Jul 2005 17:06:58 GMT
Received: from siolinh (siolinh.obspm.fr [145.238.2.86])
	by robin.gentoo.org (8.13.4/8.13.4) with ESMTP id j6RH6vtX002199
	for <gentoo-amd64@lists.gentoo.org>; Wed, 27 Jul 2005 17:06:57 GMT
Received: from jborsen (helo=localhost)
	by siolinh with local-esmtp (Exim 3.35 #1 (Debian))
	id 1DxpNF-0005n6-00
	for <gentoo-amd64@lists.gentoo.org>; Wed, 27 Jul 2005 19:07:17 +0200
Date: Wed, 27 Jul 2005 19:07:17 +0200 (CEST)
From: Jean.Borsenberger@obspm.fr
X-X-Sender: jborsen@siolinh.obspm.fr
To: gentoo-amd64@lists.gentoo.org
Subject: Re: [gentoo-amd64]  Re: Re: amd64 and kernel configuration
In-Reply-To: <pan.2005.07.27.15.42.13.414249@cox.net>
Message-ID: <Pine.LNX.4.58.0507271900500.14425@siolinh.obspm.fr>
References: <20050727062947.73020.qmail@mail.mng.mn> <pan.2005.07.27.06.10.14.928455@cox.net>
 <25f58b7910e09fd5453bb3ec534330d1@xsmail.com> <20050727075012.79549.qmail@mail.mng.mn>
 <pan.2005.07.27.10.13.13.844736@cox.net> <81469e8e0507270346445f4363@mail.gmail.com>
 <pan.2005.07.27.15.42.13.414249@cox.net>
Precedence: bulk
List-Post: <mailto:gentoo-amd64@lists.gentoo.org>
List-Help: <mailto:gentoo-amd64+help@gentoo.org>
List-Unsubscribe: <mailto:gentoo-amd64+unsubscribe@gentoo.org>
List-Subscribe: <mailto:gentoo-amd64+subscribe@gentoo.org>
List-Id: Gentoo Linux mail <gentoo-amd64.gentoo.org>
X-BeenThere: gentoo-amd64@gentoo.org
Reply-to: gentoo-amd64@lists.gentoo.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: Jean Borsenberger <jborsen@siolinh.obspm.fr>
X-Archives-Salt: 0ff2a24b-2343-4a2d-81a6-4c39705720a9
X-Archives-Hash: be9d5ca611ceaf24852a9cb45b9cf325

	Well, may be it's SUMO, but when we swicth on the NUMA option for
the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP
benchs between 20% to 30% (depending on the program considered).

Note: OPenMP is a FORTRAN variation in which you put paralelisation
directives, without boring of the implementation details, using a single
address space, for all instances of the user program.

Jean Borsenberger
tel: +33 (0)1 45 07 76 29
Observatoire de Paris Meudon
5 place Jules Janssen
92195 Meudon France

On Wed, 27 Jul 2005, Duncan wrote:

> Drew Kirkpatrick posted <81469e8e0507270346445f4363@mail.gmail.com>,
> excerpted below,  on Wed, 27 Jul 2005 05:46:28 -0500:
>
> > Just to point out, amd was calling the opterons and such more of a SUMO
> > configuration (Sufficiently Uniform Memory Organization, not joking here),
> > instead of NUMA. Whereas technically, it clearly is a NUMA system, the
> > differences in latency when accessing memory from a bank attached to
> > another processors memory controller is very small. Small enough to be
> > largely ignored, and treated like uniform memory access latencies in a SMP
> > system. Sorta in between SMP unified style memory access and NUMA. This
> > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
> > add hypertransport switches to scale over 8 chips/sockets, it'll most
> > likely be a different story...
>
> I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
> the design of the hardware.  They have a very good point, that while it's
> physically NUMA, the latencies variances are so close to unified that in
> many ways it's indistinguishable -- except for the fact that keeping it
> NUMA means allowing independent access of two different apps running on
> two different CPUs, to their own memory in parallel, rather than one
> having to wait for the other, if the memory were interleaved and unified
> (as it would be for quad channel access, if that were enabled).
>
> > What I've always wondered is, the NUMA code in the linux kernel, is this
> > for handling traditional NUMA, like in a large computer system (big iron)
> > where NUMA memory access latencies will vary greatly, or is it simply for
> > optimizing the memory usage across the memory banks. Keeping data in the
> > memory of the processor using it, etc, etc. Of course none of this matters
> > for single chip/socket amd systems, as dual cores as well as single cores
> > share a memory controller. Hmm, maybe I should drink some coffee and
> > shutup until I'm awake...
>
> Well, yeah, for single-socket/dual-core, but what about dual socket
> (either single core or dual core)?  Your questions make sense there, and
> that's what I'm running (single core, tho upgrading to dual core for a
> quad-core total board sometime next year, would be very nice, and just
> might be within the limits of my budget), so yes, I'm rather interested!
>
> The answer to your question on how the kernel deals with it, by my
> understanding, is this:  The Linux kernel SMP/NUMA architecture allows for
> "CPU affinity grouping".  In earlier kernels, it was all automated, but
> they are actually getting advanced enough now to allow deliberate manual
> splitting of various groups, and combined with userspace control
> applications, will ultimately be able to dynamically assign processes to
> one or more CPU groups of various sizes, controlling the CPU and memory
> resources available to individual processes.  So, yes, I guess that means
> it's developing some pretty "big iron" qualities, altho many of them are
> still in flux and won't be stable at least in mainline for another six
> months or a year, at minimum.
>
> Let's refocus now back on the implementation and the smaller picture once
> again, to examine these "CPU affinity zones" in a bit more detail.  The
> following is according to the writeups I've seen, mostly on LWN's weekly
> kernel pages.   (Jon Corbet, LWN editor, does a very good job of balancing
> the technical kernel hacker level stuff with the middle-ground
> not-too-technical kernel follower stuff, good enough that I find the site
> useful enough to subscribe, even tho I could get even the premium content
> a week later for free.  Yes, that's an endorsement of the site, because
> it's where a lot of my info comes from, and I'm certainly not one to try
> to keep my knowledge exclusive!)
>
> Anyway... from mainly that source...  CPU affinity zones work with sets
> and supersets of processors.  An Intel hyperthreading pair of virtual
> processors on the same physical processor will be at the highest affinity
> level, the lowest level aka strongest grouping in the hierarchy, because
> they share the same cache memory all the way up to L1 itself, and the
> Linux kernel can switch processes between the two virtual CPUs of a
> hyperthreaded CPU with zero cost or loss in performance, therefore only
> taking into account the relative balance of processes on each of the
> hyperthread virtual CPUs.
>
> At the next lowest level affinity, we'd have the dual-core AMDs, same
> chip, same memory controller, same local memory, same hypertransport
> interfaces to the chipset, other CPUs and the rest of the world, and very
> tightly cooperative, but with separate L2 and of course separate L1 cache.
> There's a slight performance penalty between switching processes between
> these CPUs, due to the cache flushing it would entail, but it's only very
> slight and quite speedy, so thread imbalance between the two processors
> doesn't have to get bad at all, before it's worth it to switch the CPUs to
> maintain balance, even at the cost of that cache flush.
>
> At a slightly lower level of affinity would be the Intel dual cores, since
> they aren't quite so tightly coupled, and don't share all the same
> interfaces to the outside world.  In practice, since only one of these,
> the Intel dual core or the AMD dual core, will normally be encountered in
> real life, they can be treated at the same level, with possibly a small
> internal tweak to the relative weighting of thread imbalance vs
> performance loss for switching CPUs, based on which one is actually in
> place.
>
> Here things get interesting, because of the different implementations
> available.  AMD's 2-way thru 8-way Opterons configured for unified memory
> access would be first, because again, their dedicated inter-CPU
> hypertransport links let them cooperate closer than conventional
> multi-socket CPUs would.  Beyond that, it's a tossup between Intel's
> unified memory multi-processors and AMD's NUMA/SUMO memory Opterons.  I'd
> still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
> Intel chips will with unified memory, due to that SUMO aspect.  At the
> same time, they have the parallel memory access advantages of NUMA.
>
> Beyond that, there's several levels of clustering, local/board, off-board
> but short-fat-pipe accessible (using technologies such as PCI
> interconnect, fibre-channel, and that SGI interconnect tech IDR the name
> of ATM), conventional (and Beowulf?) type clustering, and remote
> clustering. At each of these levels, as with the above, the cost to switch
> processes between peers at the same affinity level gets higher and higher,
> so the corresponding process imbalance necessary to trigger a switch
> likewise gets higher and higher, until at the extreme of remote
> clustering, it's almost done manually only. or anyway at the level of a
> user level application managing the transfers, rather than the kernel,
> directly (since, after all, with remote clustering, each remote group is
> probably running its own kernel, if not individual machines within that
> group).
>
> So, the point of all that is that the kernel sees a hierarchical grouping
> of CPUs, and is designed with more flexibility to balance processes and
> memory use at the extreme affinity end, and more hesitation to balance it
> due to the higher cost involved, at the extremely low affinity end.  The
> main writeup I read on the subject dealt with thread/process CPU
> switching, not memory switching, but within the context of NUMA, the
> principles become so intertwined it's impossible to separate them, and the
> writeup very clearly made the point that the memory issues involved in
> making the transfer were included in the cost accounting as well.
>
> I'm not sure whether this addressed the point you were trying to make, or
> hit beside it, but anyway, it was fun trying to put into text for the
> first time since I read about it, the principles in that writeup, along
> with other facts I've merged along the way.  My dad's a teacher, and I
> remember him many times making the point that the best way to learn
> something is to attempt to teach it.  He used that principle in his own
> classes, having the students help each other, and I remember him making
> the point about himself as well, at one point, as he struggled to teach
> basic accounting principles based only on a textbook and the single
> college intro level class he had himself taken years before, when he found
> himself teaching a high school class on the subject.  The principle is
> certainly true, as by explaining the affinity clustering principles here,
> it has forced me to ensure they form a reasonable and self consistent
> infrastructure in my own head, in ordered to be able to explain it in the
> post.  So, anyway, thanks for the intellectual stimulation!  <g>
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman in
> http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
>
>
> --
> gentoo-amd64@gentoo.org mailing list
>
>
-- 
gentoo-amd64@gentoo.org mailing list