From: Robert Myers on
On Nov 8, 3:18 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> n...(a)cam.ac.uk wrote:
> > In article <2009Nov8.192...(a)mips.complang.tuwien.ac.at>,
> > Anton Ertl <an...(a)mips.complang.tuwien.ac.at> wrote:
> >> Bernd Paysan <bernd.pay...(a)gmx.de> writes:
> >>> Anton Ertl wrote:
> >>>> So, at least for this benchmark setup, hyperthreading is a significant
> >>>> loss on the Atom.
> >>> Probably not a real surprise.  The Atom is in-order, and SMT probably
> >>> helps when you have many cache misses.  Cache misses in the LaTeX
> >>> benchmark should be rare.
> >> They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
> >> where I measured cache misses.  Ideally SMT would also help when the
> >> functional units are not completely utilized even with loads hitting
> >> the D-cache (which is probably quite frequent on an in-order machine),
> >> but I don't know if that's the case for the Atom.
>
> >> In any case, no speedup from SMT is one thing, but a significant
> >> slowdown is pretty disappointing.  Unless you know that you run lots
> >> of code that benefits from SMT, it's probably better to disable SMT on
> >> the Atom.
>
> > And not just on the Atom.  I ran some tests on the Core i7, and got
> > a degradation of throughput by using more threads.  My limited
> > experience is that applies to virtually anything where the bottleneck
> > is memory accesses.  There MAY be some programs where SMT helps with
> > cache misses, but I haven't seen them.
>
> > Where I think that it helps is with heterogeneous process mixtures;
> > e.g. one is heavy on floating-point, another on memory accesses, and
> > another on branching.  I could be wrong, as that's based on as much
> > guesswork as knowledge, but it matches what I know.
>
> This is interesting.  What Nick says about heterogenous workloads is certainly
> true - e.g. a compute intensive non-cache missing thread to switch to
> when a memory intensive thread cache misses.
> aking L1
> (Or, rather, that is always running, and which keeps running when the memory
> intensve thread cache misses.)
>
> However, in theory two memory intensive threads should be able to coexist
> - computing when the other thread is idle.  E.g. two cache missing pointer chasing
> threads should be able to practically double throughput.
> (I've usually been on the other side of this argument, since as comp.arch
> knows I am the leading exponent of single threaded MLP architectures.
> My opponents in industry would usually say "Can't you just get MLP from TLP?"
> and I would have to say "Yes, but...".)
>
> That so many people find threading a lossage for memory intensive workloads
> (and it is not just these comp.arch posters - most people in the supercomputer
> community disable hyperthreading) implies
>
> a) workloads that are already highly MLP, e.g. throughput limited workloads
>
> b) lousy threading microarchitectures. Which is typical - so many Intel processors
> arbitrarily split the instruction window in half, giving half to the compute intensive
> threads which do not need the window, and only half to the cache missing thread
> which can use more.
>
> c) contention between threads - e.g. thrashing out of useful D$ state.
>
> It's ironic - take a long latency L3 cache miss to DRAM, and the chances of more such
> are increased - because the other threads, which may only be taking L1 misses to L2,
> are thrashing your state out of the caches.   Positive feedback.

I don't know how you can discuss hyper-threading without discussing
the scheduler. There is a recent discussion on lkml.org, which seems,
well, primitive

http://lkml.org/lkml/2009/10/28/287

It refers to an Intel document

http://software.intel.com/sites/oss/pdfs/mclinux.pdf

which also seems primitive.

As to "memory-intensive." Does someone really mean "memory-bound?"
If something is memory-bound, which many HPC applications are, that's
it. Either you optimally use bandwidth or you don't. If a single
thread is memory-bound, then SMT is a loser. If a single thread on a
single core is memory bound, then using more than one core is a loser,
too.

Robert.
From: Andrew Reilly on
On Sun, 08 Nov 2009 19:01:21 -0800, Robert Myers wrote:

> I don't know how you can discuss hyper-threading without discussing the
> scheduler.

Why is that? I thought that schedulers were largely ignorant of SMT
threads, other than, perhaps, as pairs of cores with fully-shared cache.
Should the scheduler to take notice of the uber-NUMA characteristics of
the pair of shared virtual processors and schedule only appropriately-
matching processes on each? I think that there is a certain amount of
NUMA awareness in most modern (Unix) schedulers, but no-doubt there could
be more. I haven't heard of any that (for example) opt to schedule a
process with active FPU state and one without on the same physical CPU.
Could be interesting? It seems to me from this discussion that it's not
at all clear what characteristics would ideally be selected-for, in
making such a decision. [*] Have threads from the same process share an
SMT core, on the grounds that they might also share hot cache rows, and
save some fetches, or have them use separate cores, on the grounds that
they want to work on separate data, and more cache is better?

Seems like an intractable problem to me.

Maybe we could add some sort of a notion of "progress made good" hint
that applications could provide to the OS, so that it could have a better
chance at scheduling them stochastically?

[*] We often hear of loads that perform worse with SMT. Are there
equivalent rules of thumbs for load classes that *do* show improvement
with HyperThreading turned on?

Cheers,

--
Andrew
From: nmm1 on
In article <87aaywfvwo.fsf(a)ami-cg.GraySage.com>,
Chris Gray <cg(a)graysage.com> wrote:
>
>I didn't do performance stuff on the Tera MTA, but I'm thinking that
>from this discussion you could view it as having 128-way SMT, with
>only one memory interface. However, that one memory interface could
>have an outstanding fetch for each of the 128 threads, so maybe that
>means it had 128 memory interfaces for this purpose. Things did
>speed up with more threads running. Perhaps the relative costs of
>the various activities was so different that the comparison doesn't
>work?

Yes. That is why the effects are somewhat puzzling.

>Excuse my ignorance here - are today's memory systems limited to one
>outstanding fetch per CPU memory interface?

No. But the rules are non-trivial.


Regards,
Nick Maclaren.
From: Robert Myers on
On Nov 9, 2:52 am, Andrew Reilly <areilly...(a)bigpond.net.au> wrote:
> On Sun, 08 Nov 2009 19:01:21 -0800, Robert Myers wrote:
> > I don't know how you can discuss hyper-threading without discussing the
> > scheduler.
>
> Why is that?  I thought that schedulers were largely ignorant of SMT
> threads, other than, perhaps, as pairs of cores with fully-shared cache.  
> Should the scheduler to take notice of the uber-NUMA characteristics of
> the pair of shared virtual processors and schedule only appropriately-
> matching processes on each?  I think that there is a certain amount of
> NUMA awareness in most modern (Unix) schedulers, but no-doubt there could
> be more.  I haven't heard of any that (for example) opt to schedule a
> process with active FPU state and one without on the same physical CPU.  
> Could be interesting?  It seems to me from this discussion that it's not
> at all clear what characteristics would ideally be selected-for, in
> making such a decision.  [*] Have threads from the same process share an
> SMT core, on the grounds that they might also share hot cache rows, and
> save some fetches, or have them use separate cores, on the grounds that
> they want to work on separate data, and more cache is better?
>
> Seems like an intractable problem to me.
>
> Maybe we could add some sort of a notion of "progress made good" hint
> that applications could provide to the OS, so that it could have a better
> chance at scheduling them stochastically?

The current Linux scheduler is SMT-aware. It knows which "processors"
are on the same core and will load balance so that two CPU-hungry
threads won't compete on the same physical core.

I can imagine all kinds of possibilities that would monitor activity
in more detail and attempt to place threads accordingly, but I've
heard no one who would be up to the task propose such a thing.

> [*] We often hear of loads that perform worse with SMT.  Are there
> equivalent rules of thumbs for load classes that *do* show improvement
> with HyperThreading turned on?

The best results I saw for the P4 were as much as a 35% improvement
for a chess-playing game. Lots of pointer-chasing?

Robert.

From: James Van Buskirk on
"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
news:1f518c33-55c6-4b2c-8ef9-5229534cfbd2(a)j4g2000yqe.googlegroups.com...

> On Nov 9, 2:52 am, Andrew Reilly <areilly...(a)bigpond.net.au> wrote:

> > [*] We often hear of loads that perform worse with SMT. Are there
> > equivalent rules of thumbs for load classes that *do* show improvement
> > with HyperThreading turned on?

> The best results I saw for the P4 were as much as a 35% improvement
> for a chess-playing game. Lots of pointer-chasing?

http://www.mikusite.de/pages/x86.htm

Scroll down to the last table of results, compare Intel Dual Xeon
Nacona 2800 MHz HT on/off in the FPU speed column: 320.813/177.028
million iterations/second.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end