From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org)
	by nuthatch.gentoo.org with esmtp (Exim 4.43)
	id 1Dxo64-0006wg-VF
	for garchives@archives.gentoo.org; Wed, 27 Jul 2005 15:45:29 +0000
Received: from robin.gentoo.org (localhost [127.0.0.1])
	by robin.gentoo.org (8.13.4/8.13.4) with SMTP id j6RFhnQo010032;
	Wed, 27 Jul 2005 15:43:49 GMT
Received: from ciao.gmane.org (main.gmane.org [80.91.229.2])
	by robin.gentoo.org (8.13.4/8.13.4) with ESMTP id j6RFhmnm017077
	for <gentoo-amd64@lists.gentoo.org>; Wed, 27 Jul 2005 15:43:48 GMT
Received: from list by ciao.gmane.org with local (Exim 4.43)
	id 1Dxo4Z-0002vz-FW
	for gentoo-amd64@lists.gentoo.org; Wed, 27 Jul 2005 17:43:55 +0200
Received: from ip68-230-97-182.ph.ph.cox.net ([68.230.97.182])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Wed, 27 Jul 2005 17:43:55 +0200
Received: from 1i5t5.duncan by ip68-230-97-182.ph.ph.cox.net with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Wed, 27 Jul 2005 17:43:55 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: gentoo-amd64@lists.gentoo.org
From: Duncan <1i5t5.duncan@cox.net>
Subject: [gentoo-amd64]  Re: Re: amd64 and kernel configuration
Date:  Wed, 27 Jul 2005 08:42:14 -0700
Organization:  Sometimes
Message-ID:  <pan.2005.07.27.15.42.13.414249@cox.net>
References:  <20050727062947.73020.qmail@mail.mng.mn> <pan.2005.07.27.06.10.14.928455@cox.net> <25f58b7910e09fd5453bb3ec534330d1@xsmail.com> <20050727075012.79549.qmail@mail.mng.mn> <pan.2005.07.27.10.13.13.844736@cox.net> <81469e8e0507270346445f4363@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-amd64@lists.gentoo.org>
List-Help: <mailto:gentoo-amd64+help@gentoo.org>
List-Unsubscribe: <mailto:gentoo-amd64+unsubscribe@gentoo.org>
List-Subscribe: <mailto:gentoo-amd64+subscribe@gentoo.org>
List-Id: Gentoo Linux mail <gentoo-amd64.gentoo.org>
X-BeenThere: gentoo-amd64@gentoo.org
Reply-to: gentoo-amd64@lists.gentoo.org
Mime-Version:  1.0
Content-Type:  text/plain; charset=ISO-8859-1
Content-Transfer-Encoding:  8bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: ip68-230-97-182.ph.ph.cox.net
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table)
Sender: news <news@sea.gmane.org>
X-Archives-Salt: cafb2f57-505d-4141-aaf5-cd7559da84f3
X-Archives-Hash: aac4f40e0e85e8e2cf331c079c2461c8

Drew Kirkpatrick posted <81469e8e0507270346445f4363@mail.gmail.com>,
excerpted below,  on Wed, 27 Jul 2005 05:46:28 -0500:

> Just to point out, amd was calling the opterons and such more of a SUMO
> configuration (Sufficiently Uniform Memory Organization, not joking here),
> instead of NUMA. Whereas technically, it clearly is a NUMA system, the
> differences in latency when accessing memory from a bank attached to
> another processors memory controller is very small. Small enough to be
> largely ignored, and treated like uniform memory access latencies in a SMP
> system. Sorta in between SMP unified style memory access and NUMA. This
> holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
> add hypertransport switches to scale over 8 chips/sockets, it'll most
> likely be a different story...

I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
the design of the hardware.  They have a very good point, that while it's
physically NUMA, the latencies variances are so close to unified that in
many ways it's indistinguishable -- except for the fact that keeping it
NUMA means allowing independent access of two different apps running on
two different CPUs, to their own memory in parallel, rather than one
having to wait for the other, if the memory were interleaved and unified
(as it would be for quad channel access, if that were enabled).

> What I've always wondered is, the NUMA code in the linux kernel, is this
> for handling traditional NUMA, like in a large computer system (big iron)
> where NUMA memory access latencies will vary greatly, or is it simply for
> optimizing the memory usage across the memory banks. Keeping data in the
> memory of the processor using it, etc, etc. Of course none of this matters
> for single chip/socket amd systems, as dual cores as well as single cores
> share a memory controller. Hmm, maybe I should drink some coffee and
> shutup until I'm awake...

Well, yeah, for single-socket/dual-core, but what about dual socket
(either single core or dual core)?  Your questions make sense there, and
that's what I'm running (single core, tho upgrading to dual core for a
quad-core total board sometime next year, would be very nice, and just
might be within the limits of my budget), so yes, I'm rather interested!

The answer to your question on how the kernel deals with it, by my
understanding, is this:  The Linux kernel SMP/NUMA architecture allows for
"CPU affinity grouping".  In earlier kernels, it was all automated, but
they are actually getting advanced enough now to allow deliberate manual
splitting of various groups, and combined with userspace control
applications, will ultimately be able to dynamically assign processes to
one or more CPU groups of various sizes, controlling the CPU and memory
resources available to individual processes.  So, yes, I guess that means
it's developing some pretty "big iron" qualities, altho many of them are
still in flux and won't be stable at least in mainline for another six
months or a year, at minimum.

Let's refocus now back on the implementation and the smaller picture once
again, to examine these "CPU affinity zones" in a bit more detail.  The
following is according to the writeups I've seen, mostly on LWN's weekly
kernel pages.   (Jon Corbet, LWN editor, does a very good job of balancing
the technical kernel hacker level stuff with the middle-ground
not-too-technical kernel follower stuff, good enough that I find the site
useful enough to subscribe, even tho I could get even the premium content
a week later for free.  Yes, that's an endorsement of the site, because
it's where a lot of my info comes from, and I'm certainly not one to try
to keep my knowledge exclusive!)

Anyway... from mainly that source...  CPU affinity zones work with sets
and supersets of processors.  An Intel hyperthreading pair of virtual
processors on the same physical processor will be at the highest affinity
level, the lowest level aka strongest grouping in the hierarchy, because
they share the same cache memory all the way up to L1 itself, and the
Linux kernel can switch processes between the two virtual CPUs of a
hyperthreaded CPU with zero cost or loss in performance, therefore only
taking into account the relative balance of processes on each of the
hyperthread virtual CPUs.

At the next lowest level affinity, we'd have the dual-core AMDs, same
chip, same memory controller, same local memory, same hypertransport
interfaces to the chipset, other CPUs and the rest of the world, and very
tightly cooperative, but with separate L2 and of course separate L1 cache.
There's a slight performance penalty between switching processes between
these CPUs, due to the cache flushing it would entail, but it's only very
slight and quite speedy, so thread imbalance between the two processors
doesn't have to get bad at all, before it's worth it to switch the CPUs to
maintain balance, even at the cost of that cache flush.

At a slightly lower level of affinity would be the Intel dual cores, since
they aren't quite so tightly coupled, and don't share all the same
interfaces to the outside world.  In practice, since only one of these,
the Intel dual core or the AMD dual core, will normally be encountered in
real life, they can be treated at the same level, with possibly a small
internal tweak to the relative weighting of thread imbalance vs
performance loss for switching CPUs, based on which one is actually in
place.

Here things get interesting, because of the different implementations
available.  AMD's 2-way thru 8-way Opterons configured for unified memory
access would be first, because again, their dedicated inter-CPU
hypertransport links let them cooperate closer than conventional
multi-socket CPUs would.  Beyond that, it's a tossup between Intel's
unified memory multi-processors and AMD's NUMA/SUMO memory Opterons.  I'd
still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
Intel chips will with unified memory, due to that SUMO aspect.  At the
same time, they have the parallel memory access advantages of NUMA.

Beyond that, there's several levels of clustering, local/board, off-board
but short-fat-pipe accessible (using technologies such as PCI
interconnect, fibre-channel, and that SGI interconnect tech IDR the name
of ATM), conventional (and Beowulf?) type clustering, and remote
clustering. At each of these levels, as with the above, the cost to switch
processes between peers at the same affinity level gets higher and higher,
so the corresponding process imbalance necessary to trigger a switch
likewise gets higher and higher, until at the extreme of remote
clustering, it's almost done manually only. or anyway at the level of a
user level application managing the transfers, rather than the kernel,
directly (since, after all, with remote clustering, each remote group is
probably running its own kernel, if not individual machines within that
group).

So, the point of all that is that the kernel sees a hierarchical grouping
of CPUs, and is designed with more flexibility to balance processes and
memory use at the extreme affinity end, and more hesitation to balance it
due to the higher cost involved, at the extremely low affinity end.  The
main writeup I read on the subject dealt with thread/process CPU
switching, not memory switching, but within the context of NUMA, the
principles become so intertwined it's impossible to separate them, and the
writeup very clearly made the point that the memory issues involved in
making the transfer were included in the cost accounting as well.

I'm not sure whether this addressed the point you were trying to make, or
hit beside it, but anyway, it was fun trying to put into text for the
first time since I read about it, the principles in that writeup, along
with other facts I've merged along the way.  My dad's a teacher, and I
remember him many times making the point that the best way to learn
something is to attempt to teach it.  He used that principle in his own
classes, having the students help each other, and I remember him making
the point about himself as well, at one point, as he struggled to teach
basic accounting principles based only on a textbook and the single
college intro level class he had himself taken years before, when he found
himself teaching a high school class on the subject.  The principle is
certainly true, as by explaining the affinity clustering principles here,
it has forced me to ensure they form a reasonable and self consistent
infrastructure in my own head, in ordered to be able to explain it in the
post.  So, anyway, thanks for the intellectual stimulation!  <g>

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html


-- 
gentoo-amd64@gentoo.org mailing list