From: Robert Myers on
Anne & Lynn Wheeler wrote:
> Robert Myers <rbmyersusa(a)gmail.com> writes:
>> I had thought the idea of having lots of threads was precisely to get
>> the memory requests out. You start a thread, get some memory requests
>> out, and let it stall, because it's going to stall, anyway.
>>
>> Cache size and bandwidth and memory bandwidth are another matter.
>
> in mid-70s, there was a multithreaded project for the 370/195 (that
> never shipped). The 370/195 had 64 instruction pipeline, but no branch
> prediction or speculative execution ... so common branches stalled the
> pipeline. Highly tuned codes with some kinds of looping branches within
> the pipeline could have peak thruput of 10mips ... however, branch
> stalls in most code tended to hold thruput to five mips.
>
> the objective of the emulated two-processor (double registers,
> instruction address, etc ... but no additional pipeline or execution
> units) was compensate for branch stalls (i.e. instructions, operations,
> resources in the pipeline would have one-bit flag as to instruction
> stream that they were associated with). Having a pair of instruction
> streams with normal codes (peaking at 5mip/sec thruput) ... then had
> chance of effectively utilizing/saturating the available 195 resources
> (10mip aggregate).

This logic always made sense to me, but Nick claims it doesn't work. If
it doesn't work, it has to be because of pressure on the cache or
because the thread that stalls is holding a lock that the other thread
needs.

Robert.
From: nmm1 on
In article <V_%zn.25199$Db6.3878(a)newsfe05.iad>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>Anne & Lynn Wheeler wrote:
>> Robert Myers <rbmyersusa(a)gmail.com> writes:
>>> I had thought the idea of having lots of threads was precisely to get
>>> the memory requests out. You start a thread, get some memory requests
>>> out, and let it stall, because it's going to stall, anyway.
>>>
>>> Cache size and bandwidth and memory bandwidth are another matter.
>>
>> in mid-70s, there was a multithreaded project for the 370/195 (that
>> never shipped). The 370/195 had 64 instruction pipeline, but no branch
>> prediction or speculative execution ... so common branches stalled the
>> pipeline. Highly tuned codes with some kinds of looping branches within
>> the pipeline could have peak thruput of 10mips ... however, branch
>> stalls in most code tended to hold thruput to five mips.
>>
>> the objective of the emulated two-processor (double registers,
>> instruction address, etc ... but no additional pipeline or execution
>> units) was compensate for branch stalls (i.e. instructions, operations,
>> resources in the pipeline would have one-bit flag as to instruction
>> stream that they were associated with). Having a pair of instruction
>> streams with normal codes (peaking at 5mip/sec thruput) ... then had
>> chance of effectively utilizing/saturating the available 195 resources
>> (10mip aggregate).
>
>This logic always made sense to me, but Nick claims it doesn't work. If
>it doesn't work, it has to be because of pressure on the cache or
>because the thread that stalls is holding a lock that the other thread
>needs.

Not quite. I have never claimed that it is without effect, merely
that the effect isn't what is claimed! I omitted a paragraph where
I said that there WAS a time when the technique would have worked,
but it wasn't when it was used.

Back in the 1970s, computational units were a scarce resource in
CPU design, and the thing that the SMT approach does make better
use of is computational units. So it would have worked then, as
it would in the 1980s on microprocessors (when, again, computational
units were a scarce resource, because of limited transistor count).

However, by the year 2000, and even in the 1990s, they were NOT a
scarce resource any longer, and the limits were invariably memory
and cache bandwidth, transaction rate and conflict resolution.
How would they help with that?

Well, as the Tera MTA showed, they could - in a machine designed
for that purpose. But in what we now know as a general-purpose
CPU?

To a first approximation, two threads or two cores have the same
memory and cache requirements, so they don't do any better than
multiple cores there. They still make better use of computational
units, but at the expense of some extra logic and less performance
compared to multi-core designs. How much?

Well, when I looked at the papers, their efficiency was good for
2-way threading, but dropped off badly for 4-way and was definitely
poor for 8-way. And that was analysing the simple, clean MIPS
architecture - even done well, x86 would not have been as good.

So a much better, more scalable, design is to forget about threading
and simply go for more cores. Notice that even Intel has never
delivered a CPU with more than 2-way threading, and there are a
lot of people who say the route to performance is to disable even
that.

To put it another way, they are a solution to a problem of the
1970s and 1980s, not to one of the 1990s and later.


Regards,
Nick Maclaren.
From: Robert Myers on
nmm1(a)cam.ac.uk wrote:
> In article <V_%zn.25199$Db6.3878(a)newsfe05.iad>,

>
> Not quite. I have never claimed that it is without effect, merely
> that the effect isn't what is claimed! I omitted a paragraph where
> I said that there WAS a time when the technique would have worked,
> but it wasn't when it was used.
>
> Back in the 1970s, computational units were a scarce resource in
> CPU design, and the thing that the SMT approach does make better
> use of is computational units. So it would have worked then, as
> it would in the 1980s on microprocessors (when, again, computational
> units were a scarce resource, because of limited transistor count).
>
> However, by the year 2000, and even in the 1990s, they were NOT a
> scarce resource any longer, and the limits were invariably memory
> and cache bandwidth, transaction rate and conflict resolution.
> How would they help with that?
>
> Well, as the Tera MTA showed, they could - in a machine designed
> for that purpose. But in what we now know as a general-purpose
> CPU?
>
> To a first approximation, two threads or two cores have the same
> memory and cache requirements, so they don't do any better than
> multiple cores there. They still make better use of computational
> units, but at the expense of some extra logic and less performance
> compared to multi-core designs. How much?
>
> Well, when I looked at the papers, their efficiency was good for
> 2-way threading, but dropped off badly for 4-way and was definitely
> poor for 8-way. And that was analysing the simple, clean MIPS
> architecture - even done well, x86 would not have been as good.
>
> So a much better, more scalable, design is to forget about threading
> and simply go for more cores. Notice that even Intel has never
> delivered a CPU with more than 2-way threading, and there are a
> lot of people who say the route to performance is to disable even
> that.
>
> To put it another way, they are a solution to a problem of the
> 1970s and 1980s, not to one of the 1990s and later.
>

I think we've been through the computational resources are no longer
scarce discussion wrt hyperthreading in this forum.

But suppose the scarce resource isn't computational resources, but other
things, like L1 and L2 cache and watts. You add more cores, you need
more of both.

I think that, with proper cache management, trashing L1 and perhaps even
L2 for the thread that *can* advance makes more sense than duplicating
expensive cache that will be idled on a separate core. I'm in *way*
over my head here.

As to the 2-threads vs. many-threads argument, I suspect that I agree
with you, but that's based purely on seeing the point of diminishing
returns with hyperthreading and by the fact that a factor of two seems
just about right for core overloading.

Robert.
From: nmm1 on
In article <781An.11623$0_7.8171(a)newsfe25.iad>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>
>I think we've been through the computational resources are no longer
>scarce discussion wrt hyperthreading in this forum.

Yes.

>But suppose the scarce resource isn't computational resources, but other
>things, like L1 and L2 cache and watts. You add more cores, you need
>more of both.

Right. But see below.

>I think that, with proper cache management, trashing L1 and perhaps even
>L2 for the thread that *can* advance makes more sense than duplicating
>expensive cache that will be idled on a separate core. I'm in *way*
>over my head here.

That is certainly true, but we should compare a dual-threaded system
with a dual-core one that shares at least level 2 cache. No gain there.

The question is whether the duplication and synchronisation of level 1
cache costs more than the register set juggling needed to run the two
threads. No, I can't answer that any more than you can, but it looks
as if it is pretty well balanced.

So, on the above basis, it's purely a matter of taste. But now let's
consider performance registers and tunability - threading more-or-less
sacrifices those which, in turn, lowers the efficiency of the system
because the applications are less well tuned. Well, somewhat.

I don't think that CPU threading is completely insane, but it's not a
solution to the problems it is often claimed to solve.


Regards,
Nick Maclaren.
From: Morten Reistad on
In article <SzbAn.202687$Ye4.66545(a)newsfe11.iad>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>MitchAlsup wrote:
>
>>
>> Also note: if you look at the volume of chips that go into servers and
>> other big iron, it represents an aftenoon in the FAB per year compared
>> to the desktop and notebooks,... A profitable afternoon, but not big
>> enough for an Intel nor AMD to alter design team directions.
>>
>
>If you are Google, though, you can make your own rules, if you want to
>badly enough:
>
>http://www.channelregister.co.uk/2010/04/22/google_the_server_chip_designer/
>
><quote>
>
>But an earlier Times story indicated that Agnilux [recently acquired by
>Google] was brewing "some kind of server."
>
></quote>
>
>If anyone has the incentive to build a no-frills, low-power chip that
>can afford to wait, if necessary, it would be Google.
>
>Data centers may not account for much chip volume, but they sure do
>gobble electricity.

These designs seem to be low-hanging fruit. If you are Intel, or
even AMD, and possibly Via, you should be able to take a couple
of years old design, implement it in a modern process, and use
the leftover space to cache, cache, more cache and a cache interconnect.

If bits and pieces could be powered on and off under os / hypervisor
control it could be a real winner for laptops too; keeping a small
cache and a single processor running when not doing anything major;
and fire it all up when there is system load.

The next issue is to be a little more intelligent about the cache
replacement; since it is so vital for the performance of the system.

-- mrr