Performance of SMT on Atom [Computer Architecture]

Prev: Now its easy to become System Engineer - Get $43,000+ Salary/year
Next: Anyone going to Supercomputers '09 in Portland?

From: Rick Jones on 9 Nov 2009 13:21

nmm1(a)cam.ac.uk wrote:
> And not just on the Atom. I ran some tests on the Core i7, and got
> a degradation of throughput by using more threads. My limited
> experience is that applies to virtually anything where the
> bottleneck is memory accesses.

By that I presume you mean throughput?

> There MAY be some programs where SMT helps with cache misses, but I
> haven't seen them.

Wouldn't they be alluded to in some of the SPECcpu2006 "rate"
benchmarks published with HT on vs off? The "base" rules require that
all benchmarks run the same number of copies, so loss vs gain may be
obscured, but peak allows different numbers of copies for each
benchmark, so one might see copy number changes from base to peak as
suggesting something about the effectiveness of HT for that benchmark.

rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

From: nmm1 on 9 Nov 2009 14:20

In article <hd9mjk$use$3(a)usenet01.boi.hp.com>,
Rick Jones <rick.jones2(a)hp.com> wrote:
>
>> And not just on the Atom. I ran some tests on the Core i7, and got
>> a degradation of throughput by using more threads. My limited
>> experience is that applies to virtually anything where the
>> bottleneck is memory accesses.
>
>By that I presume you mean throughput?

Yes.

>> There MAY be some programs where SMT helps with cache misses, but I
>> haven't seen them.
>
>Wouldn't they be alluded to in some of the SPECcpu2006 "rate"
>benchmarks published with HT on vs off? The "base" rules require that
>all benchmarks run the same number of copies, so loss vs gain may be
>obscured, but peak allows different numbers of copies for each
>benchmark, so one might see copy number changes from base to peak as
>suggesting something about the effectiveness of HT for that benchmark.

Yes. As I said, I haven't had time to study this area in depth.

Regards,
Nick Maclaren.

From: Gavin Scott on 9 Nov 2009 14:27

Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
> So, at least for this benchmark setup, hyperthreading is a significant
> loss on the Atom.

Just for anecdote, on my dual Nehalem E5530 Dell T7500, if I run a
3D rendering test I see linear speedup going from 1->2->4->8 threads,
then about a 20-25% improvement going from 8->16 threads which seems
pretty good to me.

I haven't tried physically disabling hyperthreading, so this assumes
Windows (Vista 64) scheduler doesn't suck completely.

G.

From: Andrew Reilly on 9 Nov 2009 17:11

On Mon, 09 Nov 2009 10:04:13 -0700, James Van Buskirk wrote:

> "Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
> news:1f518c33-55c6-4b2c-8ef9-5229534cfbd2(a)j4g2000yqe.googlegroups.com...
>
>> On Nov 9, 2:52 am, Andrew Reilly <areilly...(a)bigpond.net.au> wrote:
>
>> > [*] We often hear of loads that perform worse with SMT. Are there
>> > equivalent rules of thumbs for load classes that *do* show
>> > improvement with HyperThreading turned on?
>
>> The best results I saw for the P4 were as much as a 35% improvement for
>> a chess-playing game. Lots of pointer-chasing?
>
> http://www.mikusite.de/pages/x86.htm
>
> Scroll down to the last table of results, compare Intel Dual Xeon Nacona
> 2800 MHz HT on/off in the FPU speed column: 320.813/177.028 million
> iterations/second.

Closer to the top, though, is a pair of Core i7 920 results at 3200MHz
(admittedly already four cores/socket: don't know how much this benchmark
uses out-of-cache memory) FPU Mill iter/sec drops from 1869244 to 1820197
when HT is turned on. SSE performance goes up from 4573828 to 5138498
though. That suggests that memory isn't an issue, but that the SSE units
are better at being shared than the traditional FPU?

The page is a bit of a blog, with new items at the top. The figures
you've quoted are from July 2006.

Mandelbrot calculation is a benchmark of fairly limited predictive power,
IMO. :-)

Cheers,

--
Andrew

From: James Van Buskirk on 9 Nov 2009 23:50

"Andrew Reilly" <areilly---(a)bigpond.net.au> wrote in message
news:7lricgF3f2h5gU1(a)mid.individual.net...

> On Mon, 09 Nov 2009 10:04:13 -0700, James Van Buskirk wrote:

>> http://www.mikusite.de/pages/x86.htm

>> Scroll down to the last table of results, compare Intel Dual Xeon Nacona
>> 2800 MHz HT on/off in the FPU speed column: 320.813/177.028 million
>> iterations/second.

> Closer to the top, though, is a pair of Core i7 920 results at 3200MHz
> (admittedly already four cores/socket: don't know how much this benchmark
> uses out-of-cache memory) FPU Mill iter/sec drops from 1869244 to 1820197
> when HT is turned on. SSE performance goes up from 4573828 to 5138498
> though. That suggests that memory isn't an issue, but that the SSE units
> are better at being shared than the traditional FPU?

Actually the benchmark uses no memory. The table closer to the top
is a different benchmark that tries harder to saturate the FPU than
the earliest versions. It doesn't follow that the FPU is saturated
yet because there may be some sequence that prevents the CPU from
reordering instructions (when the CPU is a Core i7).

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

First | Prev |
Pages: 1 2 3 4
Prev: Now its easy to become System Engineer - Get $43,000+ Salary/year
Next: Anyone going to Supercomputers '09 in Portland?