Big OOO, SpMT, and possible designs [Computer Architecture]

Prev: FAKE CONFERENCE Call for papers : HPCS-10, USA, July 2010
Next: No lock for bt instruction ?

From: Torben �gidius Mogensen on 17 May 2010 04:59

"nedbrek" <nedbrek(a)yahoo.com> writes:

> The ironic thing, (which we demonstrated, and which made us hugely
> unpopular) is that massive many-core burns just as much (or more) power than
> a smart OOO on anything but grossly parallel applications.

That is assuming you are actually powering those that you don't actively
use. A many-core design should be able to power down individual cores
and power them up very quickly when needed.

Torben

From: nmm1 on 17 May 2010 05:29

In article <7ztyq6nbt8.fsf(a)ask.diku.dk>,
Torben �gidius Mogensen <torbenm(a)diku.dk> wrote:
>"nedbrek" <nedbrek(a)yahoo.com> writes:
>
>> The ironic thing, (which we demonstrated, and which made us hugely
>> unpopular) is that massive many-core burns just as much (or more) power than
>> a smart OOO on anything but grossly parallel applications.
>
>That is assuming you are actually powering those that you don't actively
>use. A many-core design should be able to power down individual cores
>and power them up very quickly when needed.

Indeed. And, unlike the other forms of power-saving gimmickry,
that is simple to implement and debug, and does not interfere
with tuning.

Regards,
Nick Maclaren.

From: nedbrek on 17 May 2010 07:46

Hello all,

"Torben �gidius Mogensen" <torbenm(a)diku.dk> wrote in message
news:7ztyq6nbt8.fsf(a)ask.diku.dk...
> "nedbrek" <nedbrek(a)yahoo.com> writes:
>> The ironic thing, (which we demonstrated, and which made us hugely
>> unpopular) is that massive many-core burns just as much (or more) power
>> than
>> a smart OOO on anything but grossly parallel applications.
>
> That is assuming you are actually powering those that you don't actively
> use. A many-core design should be able to power down individual cores
> and power them up very quickly when needed.

No, I'm talking about on multi-threaded workloads.

MT advocates always talk about core power. Going multi-core can be a win
for core power.

However, supporting a huge number cores will require more overhead than
supporter fewer (more power hungry, even less power efficient) cores.

The problem then becomes one of evaluating the overhead of a more
complicated core, versus the overhead of the support logic for your many
cores.

For example:
1) Even MT friendly jobs rarely scale linearly with number of cores. You
might get as high as 80 or 90% (2 cores give 1.9x). This might be enough to
justify more complicated cores given the next points.
2) More cores use more bandwidth. Bandwidth can be expensive (additional
memory controllers and RAM chips all burn lots of power). You can think of
OOO as a technique to get more performance per memory access.
3) Off chip bandwidth costs pins. Pins are expensive in themselves (and
limited). They also burn lots of power.
4) More cores need a more complicated system. You have more directory
structure, more switches, etc.
5) Often the static power can dominate (spinning hard drives, LCD power - in
addition to chip static power). If latency goes up, you must pay the static
power cost longer (optimal strategy is almost always "go fast, then sleep"
rather than "slow and steady").

You can't just say, "MT workloads demand lots of tiny cores". For a given
power budget, every workload will need to be analyzed to find the right
trade-off.

Ned

From: nmm1 on 17 May 2010 08:01

In article <hsr6qe$i8f$1(a)news.eternal-september.org>,
nedbrek <nedbrek(a)yahoo.com> wrote:
>"Torben �gidius Mogensen" <torbenm(a)diku.dk> wrote in message
>news:7ztyq6nbt8.fsf(a)ask.diku.dk...
>>
>>> The ironic thing, (which we demonstrated, and which made us hugely
>>> unpopular) is that massive many-core burns just as much (or more) power
>>> than a smart OOO on anything but grossly parallel applications.
>>
>> That is assuming you are actually powering those that you don't actively
>> use. A many-core design should be able to power down individual cores
>> and power them up very quickly when needed.
>
>No, I'm talking about on multi-threaded workloads.
>
>MT advocates always talk about core power. Going multi-core can be a win
>for core power.
>
>However, supporting a huge number cores will require more overhead than
>supporter fewer (more power hungry, even less power efficient) cores.

With reservations, agreed.

>The problem then becomes one of evaluating the overhead of a more
>complicated core, versus the overhead of the support logic for your many
>cores.

That is true.

>For example:
>1) Even MT friendly jobs rarely scale linearly with number of cores. You
>might get as high as 80 or 90% (2 cores give 1.9x). This might be enough to
>justify more complicated cores given the next points.

A lot of such jobs scale fairly well - say, sqrt(N) or better. But,
even if a lot don't do even that well, that's misleading.

The performance VERY rarely scales well with the complexity of cores,
and exceeding log(N) is rare. Oh, yes, there are occasional jobs where
you can remove a bottleneck, but that's nothing to do with scalability
(i.e. it's a one-off).

>2) More cores use more bandwidth. Bandwidth can be expensive (additional
>memory controllers and RAM chips all burn lots of power). You can think of
>OOO as a technique to get more performance per memory access.

Sorry, but that is NOT true. X performance on Y cores needs precisely
the same bandwidth as XY performance on a single core, all other factors
being the same. You are correct that some attempts at multithreading
serial code increase the bandwidth requirement, but that's an artifact
of the current approaches, and is dubiously a general rule.

>3) Off chip bandwidth costs pins. Pins are expensive in themselves (and
>limited). They also burn lots of power.

True. But that's irrelevant to whether the chip has lots of slow
cores or one fast one.

>4) More cores need a more complicated system. You have more directory
>structure, more switches, etc.

That's not true. There are many designs which don't, and some have
been very successful.

>5) Often the static power can dominate (spinning hard drives, LCD power - in
>addition to chip static power). If latency goes up, you must pay the static
>power cost longer (optimal strategy is almost always "go fast, then sleep"
>rather than "slow and steady").

True. But that's irrelevant to whether the chip has lots of slow
cores or one fast one.

>You can't just say, "MT workloads demand lots of tiny cores". For a given
>power budget, every workload will need to be analyzed to find the right
>trade-off.

True. But the days of a special computer for every job had gone
before I got into the game - and that's a LONG time back!

Regards,
Nick Maclaren.

From: Andrew Reilly on 17 May 2010 09:19

On Mon, 17 May 2010 06:46:18 -0500, nedbrek wrote:

> optimal strategy is almost always "go fast, then sleep" rather than
> "slow and steady"

Is that what did in Transmeta? Quite a bit of their literature seemed to
be about being able to tune the runtime and clock rate towards "slow and
steady": spread the smallest number of clocks across the real-time budget
by reducing the rate and voltage accordingly.

Perhaps that works for media playback, but is not representative of the
loads that power/battery users really care about?

[I had a Fujitsu/Crusoe laptop for a while: it was nice, and it would
play DVDs with little power, but it wasn't "fast" for Windows UI feel.]

Cheers,

--
Andrew

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: FAKE CONFERENCE Call for papers : HPCS-10, USA, July 2010
Next: No lock for bt instruction ?