public inbox for gentoo-amd64@lists.gentoo.org
 help / color / mirror / Atom feed
From: Jean.Borsenberger@obspm.fr
To: gentoo-amd64@lists.gentoo.org
Subject: Re: [gentoo-amd64]  Re: Re: amd64 and kernel configuration
Date: Wed, 27 Jul 2005 19:07:17 +0200 (CEST)	[thread overview]
Message-ID: <Pine.LNX.4.58.0507271900500.14425@siolinh.obspm.fr> (raw)
In-Reply-To: <pan.2005.07.27.15.42.13.414249@cox.net>

	Well, may be it's SUMO, but when we swicth on the NUMA option for
the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP
benchs between 20% to 30% (depending on the program considered).

Note: OPenMP is a FORTRAN variation in which you put paralelisation
directives, without boring of the implementation details, using a single
address space, for all instances of the user program.

Jean Borsenberger
tel: +33 (0)1 45 07 76 29
Observatoire de Paris Meudon
5 place Jules Janssen
92195 Meudon France

On Wed, 27 Jul 2005, Duncan wrote:

> Drew Kirkpatrick posted <81469e8e0507270346445f4363@mail.gmail.com>,
> excerpted below,  on Wed, 27 Jul 2005 05:46:28 -0500:
>
> > Just to point out, amd was calling the opterons and such more of a SUMO
> > configuration (Sufficiently Uniform Memory Organization, not joking here),
> > instead of NUMA. Whereas technically, it clearly is a NUMA system, the
> > differences in latency when accessing memory from a bank attached to
> > another processors memory controller is very small. Small enough to be
> > largely ignored, and treated like uniform memory access latencies in a SMP
> > system. Sorta in between SMP unified style memory access and NUMA. This
> > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
> > add hypertransport switches to scale over 8 chips/sockets, it'll most
> > likely be a different story...
>
> I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
> the design of the hardware.  They have a very good point, that while it's
> physically NUMA, the latencies variances are so close to unified that in
> many ways it's indistinguishable -- except for the fact that keeping it
> NUMA means allowing independent access of two different apps running on
> two different CPUs, to their own memory in parallel, rather than one
> having to wait for the other, if the memory were interleaved and unified
> (as it would be for quad channel access, if that were enabled).
>
> > What I've always wondered is, the NUMA code in the linux kernel, is this
> > for handling traditional NUMA, like in a large computer system (big iron)
> > where NUMA memory access latencies will vary greatly, or is it simply for
> > optimizing the memory usage across the memory banks. Keeping data in the
> > memory of the processor using it, etc, etc. Of course none of this matters
> > for single chip/socket amd systems, as dual cores as well as single cores
> > share a memory controller. Hmm, maybe I should drink some coffee and
> > shutup until I'm awake...
>
> Well, yeah, for single-socket/dual-core, but what about dual socket
> (either single core or dual core)?  Your questions make sense there, and
> that's what I'm running (single core, tho upgrading to dual core for a
> quad-core total board sometime next year, would be very nice, and just
> might be within the limits of my budget), so yes, I'm rather interested!
>
> The answer to your question on how the kernel deals with it, by my
> understanding, is this:  The Linux kernel SMP/NUMA architecture allows for
> "CPU affinity grouping".  In earlier kernels, it was all automated, but
> they are actually getting advanced enough now to allow deliberate manual
> splitting of various groups, and combined with userspace control
> applications, will ultimately be able to dynamically assign processes to
> one or more CPU groups of various sizes, controlling the CPU and memory
> resources available to individual processes.  So, yes, I guess that means
> it's developing some pretty "big iron" qualities, altho many of them are
> still in flux and won't be stable at least in mainline for another six
> months or a year, at minimum.
>
> Let's refocus now back on the implementation and the smaller picture once
> again, to examine these "CPU affinity zones" in a bit more detail.  The
> following is according to the writeups I've seen, mostly on LWN's weekly
> kernel pages.   (Jon Corbet, LWN editor, does a very good job of balancing
> the technical kernel hacker level stuff with the middle-ground
> not-too-technical kernel follower stuff, good enough that I find the site
> useful enough to subscribe, even tho I could get even the premium content
> a week later for free.  Yes, that's an endorsement of the site, because
> it's where a lot of my info comes from, and I'm certainly not one to try
> to keep my knowledge exclusive!)
>
> Anyway... from mainly that source...  CPU affinity zones work with sets
> and supersets of processors.  An Intel hyperthreading pair of virtual
> processors on the same physical processor will be at the highest affinity
> level, the lowest level aka strongest grouping in the hierarchy, because
> they share the same cache memory all the way up to L1 itself, and the
> Linux kernel can switch processes between the two virtual CPUs of a
> hyperthreaded CPU with zero cost or loss in performance, therefore only
> taking into account the relative balance of processes on each of the
> hyperthread virtual CPUs.
>
> At the next lowest level affinity, we'd have the dual-core AMDs, same
> chip, same memory controller, same local memory, same hypertransport
> interfaces to the chipset, other CPUs and the rest of the world, and very
> tightly cooperative, but with separate L2 and of course separate L1 cache.
> There's a slight performance penalty between switching processes between
> these CPUs, due to the cache flushing it would entail, but it's only very
> slight and quite speedy, so thread imbalance between the two processors
> doesn't have to get bad at all, before it's worth it to switch the CPUs to
> maintain balance, even at the cost of that cache flush.
>
> At a slightly lower level of affinity would be the Intel dual cores, since
> they aren't quite so tightly coupled, and don't share all the same
> interfaces to the outside world.  In practice, since only one of these,
> the Intel dual core or the AMD dual core, will normally be encountered in
> real life, they can be treated at the same level, with possibly a small
> internal tweak to the relative weighting of thread imbalance vs
> performance loss for switching CPUs, based on which one is actually in
> place.
>
> Here things get interesting, because of the different implementations
> available.  AMD's 2-way thru 8-way Opterons configured for unified memory
> access would be first, because again, their dedicated inter-CPU
> hypertransport links let them cooperate closer than conventional
> multi-socket CPUs would.  Beyond that, it's a tossup between Intel's
> unified memory multi-processors and AMD's NUMA/SUMO memory Opterons.  I'd
> still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
> Intel chips will with unified memory, due to that SUMO aspect.  At the
> same time, they have the parallel memory access advantages of NUMA.
>
> Beyond that, there's several levels of clustering, local/board, off-board
> but short-fat-pipe accessible (using technologies such as PCI
> interconnect, fibre-channel, and that SGI interconnect tech IDR the name
> of ATM), conventional (and Beowulf?) type clustering, and remote
> clustering. At each of these levels, as with the above, the cost to switch
> processes between peers at the same affinity level gets higher and higher,
> so the corresponding process imbalance necessary to trigger a switch
> likewise gets higher and higher, until at the extreme of remote
> clustering, it's almost done manually only. or anyway at the level of a
> user level application managing the transfers, rather than the kernel,
> directly (since, after all, with remote clustering, each remote group is
> probably running its own kernel, if not individual machines within that
> group).
>
> So, the point of all that is that the kernel sees a hierarchical grouping
> of CPUs, and is designed with more flexibility to balance processes and
> memory use at the extreme affinity end, and more hesitation to balance it
> due to the higher cost involved, at the extremely low affinity end.  The
> main writeup I read on the subject dealt with thread/process CPU
> switching, not memory switching, but within the context of NUMA, the
> principles become so intertwined it's impossible to separate them, and the
> writeup very clearly made the point that the memory issues involved in
> making the transfer were included in the cost accounting as well.
>
> I'm not sure whether this addressed the point you were trying to make, or
> hit beside it, but anyway, it was fun trying to put into text for the
> first time since I read about it, the principles in that writeup, along
> with other facts I've merged along the way.  My dad's a teacher, and I
> remember him many times making the point that the best way to learn
> something is to attempt to teach it.  He used that principle in his own
> classes, having the students help each other, and I remember him making
> the point about himself as well, at one point, as he struggled to teach
> basic accounting principles based only on a textbook and the single
> college intro level class he had himself taken years before, when he found
> himself teaching a high school class on the subject.  The principle is
> certainly true, as by explaining the affinity clustering principles here,
> it has forced me to ensure they form a reasonable and self consistent
> infrastructure in my own head, in ordered to be able to explain it in the
> post.  So, anyway, thanks for the intellectual stimulation!  <g>
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman in
> http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
>
>
> --
> gentoo-amd64@gentoo.org mailing list
>
>
-- 
gentoo-amd64@gentoo.org mailing list



  reply	other threads:[~2005-07-27 17:09 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-07-27  6:29 [gentoo-amd64] amd64 and kernel configuration Dulmandakh Sukhbaatar
2005-07-27  6:10 ` [gentoo-amd64] " Duncan
2005-07-27  6:19   ` NY Kwok
2005-07-27  7:50     ` Dulmandakh Sukhbaatar
2005-07-27  7:04       ` Michal Žeravík
2005-07-27  9:58         ` netpython
2005-07-27 12:30           ` Brett Johnson
2005-07-27 15:58             ` [gentoo-amd64] " Duncan
2005-07-27 10:02         ` Duncan
2005-07-27 10:13       ` [gentoo-amd64] " Duncan
2005-07-27 10:27         ` Paolo Ripamonti
2005-07-27 14:19           ` [gentoo-amd64] " Duncan
2005-07-27 14:31             ` Paolo Ripamonti
2005-07-27 16:16               ` [gentoo-amd64] " Duncan
2005-07-27 10:46         ` [gentoo-amd64] " Drew Kirkpatrick
2005-07-27 15:42           ` [gentoo-amd64] " Duncan
2005-07-27 17:07             ` Jean.Borsenberger [this message]
2005-07-27 10:18     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.58.0507271900500.14425@siolinh.obspm.fr \
    --to=jean.borsenberger@obspm.fr \
    --cc=gentoo-amd64@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox