From: nmm1 on
In article <4AF727AA.20207(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>>
>> And not just on the Atom. I ran some tests on the Core i7, and got
>> a degradation of throughput by using more threads. My limited
>> experience is that applies to virtually anything where the bottleneck
>> is memory accesses. There MAY be some programs where SMT helps with
>> cache misses, but I haven't seen them.
>>
>> Where I think that it helps is with heterogeneous process mixtures;
>> e.g. one is heavy on floating-point, another on memory accesses, and
>> another on branching. I could be wrong, as that's based on as much
>> guesswork as knowledge, but it matches what I know.
>
>This is interesting. What Nick says about heterogenous workloads is certainly
>true - e.g. a compute intensive non-cache missing thread to switch to
>when a memory intensive thread cache misses.
>aking L1
>(Or, rather, that is always running, and which keeps running when the memory
>intensve thread cache misses.)
>
>However, in theory two memory intensive threads should be able to coexist
>- computing when the other thread is idle. E.g. two cache missing pointer chasing
>threads should be able to practically double throughput.

Yes. I am puzzled by the slowdowns I have seen, and which have been
reported to me by reliable sources, but none of us have had the time
to investigate the matter in depth. The issue is certainly rather
more complicated than the simplistic analyses make out.

It is quite possible that my description above is also simplistic,
and assigns the cause incorrectly.


Regards,
Nick Maclaren.
From: Bernd Paysan on
Andy "Krazy" Glew wrote:
> However, in theory two memory intensive threads should be able to
> coexist
> - computing when the other thread is idle. E.g. two cache missing
> pointer chasing threads should be able to practically double
> throughput.

How? Let's assume there is only one memory interface: the pointer
chasing threads will ask for new memory data as soon as they got their
previous cache-line, and therefore just compete for the same resource.
There can be an advantage when there are two memory interfaces, so the
chance of having them both busy is 50% - the throughput then should go
up to 150% in SMT mode (or even more with three memory interfaces).
However, as the Core i7 is already a quad-core, four native threads
already compete for three memory channels, and therefore, adding SMT
threads can't possibly help for this kind of stuff.

If there's a moderate cache miss rate, and the process is still doing
useful work between the memory requests, so the memory bandwidth is used
up only to 50% (which also takes about 50% of the execution time), then
SMT should help.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:

> Bernd Paysan <bernd.paysan(a)gmx.de> writes:
> >Anton Ertl wrote:
> >> So, at least for this benchmark setup, hyperthreading is a significant
> >> loss on the Atom.
> >
> >Probably not a real surprise. The Atom is in-order, and SMT probably
> >helps when you have many cache misses. Cache misses in the LaTeX
> >benchmark should be rare.
>
> They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
> where I measured cache misses. Ideally SMT would also help when the
> functional units are not completely utilized even with loads hitting
> the D-cache (which is probably quite frequent on an in-order machine),
> but I don't know if that's the case for the Atom.
>
> In any case, no speedup from SMT is one thing, but a significant
> slowdown is pretty disappointing. Unless you know that you run lots
> of code that benefits from SMT, it's probably better to disable SMT on
> the Atom.

Running 'yes' may be quite L1 unfriendly, depending on the size of IO
buffer. Perhaps 4 copies of Latex would run better.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Bernd Paysan on
Niels Jørgen Kruse wrote:
> Running 'yes' may be quite L1 unfriendly, depending on the size of IO
> buffer. Perhaps 4 copies of Latex would run better.

Or for Anton, something like

gforth -e ": endless begin again; endless"

which would just branch in an endless loop (no memory resources used,
cache footprint minimal, just one slot of the branch target prediction).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: Chris Gray on
Bernd Paysan <bernd.paysan(a)gmx.de> writes:

> How? Let's assume there is only one memory interface: the pointer
> chasing threads will ask for new memory data as soon as they got their
> previous cache-line, and therefore just compete for the same resource.
> There can be an advantage when there are two memory interfaces, so the
> chance of having them both busy is 50% - the throughput then should go
> up to 150% in SMT mode (or even more with three memory interfaces).
> However, as the Core i7 is already a quad-core, four native threads
> already compete for three memory channels, and therefore, adding SMT
> threads can't possibly help for this kind of stuff.

I didn't do performance stuff on the Tera MTA, but I'm thinking that
from this discussion you could view it as having 128-way SMT, with
only one memory interface. However, that one memory interface could
have an outstanding fetch for each of the 128 threads, so maybe that
means it had 128 memory interfaces for this purpose. Things did
speed up with more threads running. Perhaps the relative costs of
the various activities was so different that the comparison doesn't
work?

Excuse my ignorance here - are today's memory systems limited to one
outstanding fetch per CPU memory interface?

--
Experience should guide us, not rule us.

Chris Gray cg(a)GraySage.COM
http://www.Nalug.ORG/ (Lego)
http://www.GraySage.COM/cg/ (Other)