From: Anton Ertl on
We recently got a Zotac IONATX A board with a 1600MHz Atom N330 CPU, which
supports SMT (or "hyperthreading" in Intel's marketingspeak).

We tested it using our LaTeX benchmark
<http://www.complang.tuwien.ac.at/anton/latex-bench/>. It runs in
2.3s-2.4s (in 32-bit mode), about the same speed as a 900MHz Athlon,
or a little faster than a 1066MHz PPC 7447A, or about 5 times slowr
than a 3GHz Core 2 Duo.

Then we tested the performance when other processes were running.
With 4 hardware threads (2 cores with two threads each), we ran three
processes doing "yes >/dev/null" and one process running our LaTeX
benchmark. The results varied, but we saw user times of 5.5s and 6s
for the LaTeX benchmark.

Just for comparison, we turned off hyperthreading in the BIOS, and ran
the same setup again (i.e., 3 yes processes and one latex process).
This time we saw 2.3s-2.4s user time for the latex benchmark and 4.7s
real time for the latex benchmark.

So, at least for this benchmark setup, hyperthreading is a significant
loss on the Atom.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Bernd Paysan on
Anton Ertl wrote:
> So, at least for this benchmark setup, hyperthreading is a significant
> loss on the Atom.

Probably not a real surprise. The Atom is in-order, and SMT probably
helps when you have many cache misses. Cache misses in the LaTeX
benchmark should be rare.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: Anton Ertl on
Bernd Paysan <bernd.paysan(a)gmx.de> writes:
>Anton Ertl wrote:
>> So, at least for this benchmark setup, hyperthreading is a significant
>> loss on the Atom.
>
>Probably not a real surprise. The Atom is in-order, and SMT probably
>helps when you have many cache misses. Cache misses in the LaTeX
>benchmark should be rare.

They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
where I measured cache misses. Ideally SMT would also help when the
functional units are not completely utilized even with loads hitting
the D-cache (which is probably quite frequent on an in-order machine),
but I don't know if that's the case for the Atom.

In any case, no speedup from SMT is one thing, but a significant
slowdown is pretty disappointing. Unless you know that you run lots
of code that benefits from SMT, it's probably better to disable SMT on
the Atom.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: nmm1 on
In article <2009Nov8.192936(a)mips.complang.tuwien.ac.at>,
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
>Bernd Paysan <bernd.paysan(a)gmx.de> writes:
>>Anton Ertl wrote:
>>> So, at least for this benchmark setup, hyperthreading is a significant
>>> loss on the Atom.
>>
>>Probably not a real surprise. The Atom is in-order, and SMT probably
>>helps when you have many cache misses. Cache misses in the LaTeX
>>benchmark should be rare.
>
>They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
>where I measured cache misses. Ideally SMT would also help when the
>functional units are not completely utilized even with loads hitting
>the D-cache (which is probably quite frequent on an in-order machine),
>but I don't know if that's the case for the Atom.
>
>In any case, no speedup from SMT is one thing, but a significant
>slowdown is pretty disappointing. Unless you know that you run lots
>of code that benefits from SMT, it's probably better to disable SMT on
>the Atom.

And not just on the Atom. I ran some tests on the Core i7, and got
a degradation of throughput by using more threads. My limited
experience is that applies to virtually anything where the bottleneck
is memory accesses. There MAY be some programs where SMT helps with
cache misses, but I haven't seen them.

Where I think that it helps is with heterogeneous process mixtures;
e.g. one is heavy on floating-point, another on memory accesses, and
another on branching. I could be wrong, as that's based on as much
guesswork as knowledge, but it matches what I know.


Regards,
Nick Maclaren.
From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
> In article <2009Nov8.192936(a)mips.complang.tuwien.ac.at>,
> Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
>> Bernd Paysan <bernd.paysan(a)gmx.de> writes:
>>> Anton Ertl wrote:
>>>> So, at least for this benchmark setup, hyperthreading is a significant
>>>> loss on the Atom.
>>> Probably not a real surprise. The Atom is in-order, and SMT probably
>>> helps when you have many cache misses. Cache misses in the LaTeX
>>> benchmark should be rare.
>> They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
>> where I measured cache misses. Ideally SMT would also help when the
>> functional units are not completely utilized even with loads hitting
>> the D-cache (which is probably quite frequent on an in-order machine),
>> but I don't know if that's the case for the Atom.
>>
>> In any case, no speedup from SMT is one thing, but a significant
>> slowdown is pretty disappointing. Unless you know that you run lots
>> of code that benefits from SMT, it's probably better to disable SMT on
>> the Atom.
>
> And not just on the Atom. I ran some tests on the Core i7, and got
> a degradation of throughput by using more threads. My limited
> experience is that applies to virtually anything where the bottleneck
> is memory accesses. There MAY be some programs where SMT helps with
> cache misses, but I haven't seen them.
>
> Where I think that it helps is with heterogeneous process mixtures;
> e.g. one is heavy on floating-point, another on memory accesses, and
> another on branching. I could be wrong, as that's based on as much
> guesswork as knowledge, but it matches what I know.



This is interesting. What Nick says about heterogenous workloads is certainly
true - e.g. a compute intensive non-cache missing thread to switch to
when a memory intensive thread cache misses.
aking L1
(Or, rather, that is always running, and which keeps running when the memory
intensve thread cache misses.)

However, in theory two memory intensive threads should be able to coexist
- computing when the other thread is idle. E.g. two cache missing pointer chasing
threads should be able to practically double throughput.
(I've usually been on the other side of this argument, since as comp.arch
knows I am the leading exponent of single threaded MLP architectures.
My opponents in industry would usually say "Can't you just get MLP from TLP?"
and I would have to say "Yes, but...".)

That so many people find threading a lossage for memory intensive workloads
(and it is not just these comp.arch posters - most people in the supercomputer
community disable hyperthreading) implies

a) workloads that are already highly MLP, e.g. throughput limited workloads

b) lousy threading microarchitectures. Which is typical - so many Intel processors
arbitrarily split the instruction window in half, giving half to the compute intensive
threads which do not need the window, and only half to the cache missing thread
which can use more.

c) contention between threads - e.g. thrashing out of useful D$ state.

It's ironic - take a long latency L3 cache miss to DRAM, and the chances of more such
are increased - because the other threads, which may only be taking L1 misses to L2,
are thrashing your state out of the caches. Positive feedback.