Xeon 3460 SMT power consumption and performance [Computer Architecture]

Prev: Texture units as a general function
Next: Buy Cardizem Overnight Delivery on all Cardizem Orders

From: Anton Ertl on 16 Dec 2009 11:23

We recently got a new server that uses the Xeon 3460 (2.8GHz
Lynnfield, like a Core i7-860). We use that with an Intel S3420GPLC
server board. The other components relevant for power consumption are
1 2GB DDR3 ECC DIMM (the working configuration will have 2 4GB DIMMs,
but that's only going to add a few W of power consumption), 2 spinning
3.5" SATA hard disks, an idle DVD-ROM and a Corsair CMPSU-400CX 80+
power supply.

Turbo boost was enabled, but apparently did not work in our setup
(Debian Lenny (Linux-2.6.26)): our LaTeX benchmark had the same speed
at all loads up to 4.

We also left SMT ("hyperthreading") enabled.

The power consumption of the whole box at different loads (generated
with "yes >/dev/null") is:

load 0 1 2 3 4 5 6 7 8
2800MHz 53W 80W 96W 118W 138W 145W 149W 151W 155W

It's interesting that using SMT contexts (at loads 5-8) costs
additional power; not as much as running a core, but a measurable
amount.

The other remarkable thing is that the idle power is pretty low
compared to other single-socket servers we have: 103W for a system
with a Xeon 3070 (2.66GHz Core 2 Duo-like), 83W for a system with an
Athlon 64 X2 4400+.

How does SMT affect performance? We varied the number of running
"yes" threads and measured our LaTeX benchmark
<http://www.complang.tuwien.ac.at/anton/latex-bench/>. When we
started the LateX benchmark concurrently to 0 or 3 yes processes, we
saw user (and real) times of around 0.484s. When we ran it
concurrently with 4 or 7 yes processes, we saw user (and real) times
of around 0.756s. I.e., we get a slowdown by a factor <1.6, whereas
without SMT we would have seen a real-time slowdown by a factor of 2
(but no change in user time). So SMT gives a significant benefit for
this setup. On the Atom the same setup resulted in a slowdown by a
factor >2.3 <2009Sep8.131554(a)mips.complang.tuwien.ac.at>, so there SMT
is a disadvantage for this setup.

As usual, this is just a single data point and your setup will be
different, so YMMV.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: Bengt Larsson on 19 Dec 2009 16:49

anton(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote:

>How does SMT affect performance? We varied the number of running
>"yes" threads and measured our LaTeX benchmark
><http://www.complang.tuwien.ac.at/anton/latex-bench/>. When we
>started the LateX benchmark concurrently to 0 or 3 yes processes, we
>saw user (and real) times of around 0.484s. When we ran it
>concurrently with 4 or 7 yes processes, we saw user (and real) times
>of around 0.756s. I.e., we get a slowdown by a factor <1.6, whereas
>without SMT we would have seen a real-time slowdown by a factor of 2
>(but no change in user time). So SMT gives a significant benefit for
>this setup. On the Atom the same setup resulted in a slowdown by a
>factor >2.3 <2009Sep8.131554(a)mips.complang.tuwien.ac.at>, so there SMT
>is a disadvantage for this setup.

I don't think "yes > /dev/null" is a good way to test SMT. It's a
process that entirely hogs the CPU, but only on integer. A normal
process accesses memory and second-level-cache now and then.

I have an Atom, and I tested with a parallell make (of an editor,
mg2a, in C). With all the files in memory, the make takes 14.4
seconds. with make -j (make -j 3 or 4 seems the most efficient) it
takes 10.7 seconds. That is an improvement with 30-35 percent.

In fact, even for an integer-hogging process you will get different
results depending on how many execution units can be used in
parallell. Or which execution units.

SMT has its uses, but it makes measurement hard. I love that Atom has
SMT though. That was a brilliant decision. For example, when watching
Youtube, there are two main threads using CPU, and 2-3 small ones. So
the multithreading really should help. Unfortunately I can't turn off
the SMT, so I can't test that. (on Acer Aspire One)

From: Bengt Larsson on 19 Dec 2009 17:19

Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote:

>I have an Atom, and I tested with a parallell make (of an editor,
>mg2a, in C). With all the files in memory, the make takes 14.4
>seconds. with make -j (make -j 3 or 4 seems the most efficient) it
>takes 10.7 seconds. That is an improvement with 30-35 percent.

Actually that is a bit stupid, since it improves beyond 2 threads.
With two threads, I get 11.3 seconds, an improvment with 27%.

From: Bengt Larsson on 19 Dec 2009 17:38

Here is a fun test:

int i; double f;

for (i=0; i<limit; i++) {
f += 1.0;
}

Improvement in throughput with two threads = 100%!

The floating point add should take 5 cycles on the Atom, and indeed a
loop is very close to 5 cycles per iteration. This also illustrates
the "limited slip" in Atom, where integer instructions can bypass
long-running floating point ones.

From: nedbrek on 20 Dec 2009 12:04

Hello all,

"Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message
news:a9kqi5tp99eana4uoc2r9d0l998gpuu21g(a)4ax.com...
> Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote:
>
>>I have an Atom, and I tested with a parallell make (of an editor,
>>mg2a, in C). With all the files in memory, the make takes 14.4
>>seconds. with make -j (make -j 3 or 4 seems the most efficient) it
>>takes 10.7 seconds. That is an improvement with 30-35 percent.
>
> Actually that is a bit stupid, since it improves beyond 2 threads.
> With two threads, I get 11.3 seconds, an improvment with 27%.

I usually do a "make -j N", where N = cores * 1.5 or 2. Compiling often
gets stuck on disk (even if the source is in memory, and the final output is
memory [ramdisk?], are all the temporary outs in memory? what about staticly
linked libs?).

Ned

| Next | Last
Pages: 1 2 3
Prev: Texture units as a general function
Next: Buy Cardizem Overnight Delivery on all Cardizem Orders