From: "Andy "Krazy" Glew" on
Brett Davis wrote:
> In article <4B00EB3A.3060700(a)patten-glew.net>,
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:
>> Are you bypassing between the clusters? If so, you have a bypass
>> cluster. Or else how are you transferring registers across iterations?
>> It sounds as if you have an incomplete 2 cycle inter-cluster bypass.
>
> After looking at the bypass problem I have decided that there will be none.
> Minimal signaling, on the loop branch you signal a loop detect, and
> after a loop or a few loops signal an attempt to to do odd loops only,
> the second CPU would ack and do even loops.
>
> So this is separate CPUs running the same code, no sharing of register state.
> There are some other things you can do also, if one CPU is farther ahead
> and fails a speculated branch, the second CPU has less state to rewind, if
> any, and so the second CPU takes the lead. Faster than one CPU.

I take it that you are designing an ISA from scratch, rather than trying
to apply SpMT to existing binaries (which have registers live across
iterations).

Several of us have looked at "all state in memory" SpMT.


> Few common folk can make any use of multi-cores, if I can turn my 8 core
> Bulldozer into a 4 core thats 5% faster, I will, as will most.

Actually, if you can turn off 4 of the cores completely, you will
probably save enough power/heat that you can run the remaining cores
more than 5-10% faster.

Your idea may still be good, but you have to win more than 5%.


> "Loops are easy." ;) Pray tell where these plentiful easy to speedup
> areas are in CPU design. ;) Run strait into a brick wall you have. ;)

I haven't.

The guys I left behind have.

From: Matt on
On 15 Nov., 07:50, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> Brett Davis wrote:
> > Bulldozer details + bobcat
>
> BRIEF:
>
> AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded)
> microarchitecture.  That's my baby!
>
> DETAIL:
[...]
> True, there are differences, and I am sure more will become evident as
> more Bulldozer information becomes public.  For example, although I came
> up with MCMT to make Willamette-style threading faster, I have always
> wanted to put SpMT, Speculative Multithreading, on such a substrate.
> SpMT has potential to speed up a single thread of execution, by
> splitting it up into separate threads and running the separate threads
> on different clusters, whereas Willamette-style hyperthreading, and
> Bulldizer-style MCMT (apparently), only speed up workloads that have
> existing independent threads. I still want to build SpMT.  My work at
> Wisconsin showed that SpMT on a Willamette substrate was constrained by
> Willamette's poor threading microarchitecture, so naturally I had to
> first create the best explicit threading microarchitecture I could, and
> then run SpT on top of it.

Have you seen the "eager execution" patent application (app. no.
20090172370)? I mentioned it in my blog a while back (
http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
). This patent application mentions some ways to use the clusters to
speed up single thread performance. There are some more patent
applications on this topic, but they just repeat most of the methods
listed in the "eager execution" one.

Matt
("Dresdenboy")
From: Terje Mathisen on
Mayan Moudgill wrote:
> Consider the following loop (and assume all loads hit in cache).
>
> while( p != NULL ) {
> n++;
> p = p->next;
> }
>
> Please unroll to arbitrarily multiply the IPC.

The best you can do here is to have the hardware notice that (since each
node is the same size) the stride is mostly constant, so hw prefetch can
have a chance to get the next node/pointers resident in L1.

You'll still be limited to whatever the minimum L1 load-to-use latency
is, and quite often a lot worse.

Using sw to solve the same problem is _very_ problematical, at best:

Having something like Itanium advanced (fault-free) load attempts could
let you try to run ahead, but you would _very_ quickly run out of L1
bandwidth, issue slots and power.

The only solution to this sort of problem is some form of clustering,
where multiple nodes are known to be stored back-to-back in a single
block of memory.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Mayan Moudgill on
Matt wrote:
> Have you seen the "eager execution" patent application (app. no.
> 20090172370)? I mentioned it in my blog a while back (
> http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
> ). This patent application mentions some ways to use the clusters to
> speed up single thread performance. There are some more patent
> applications on this topic, but they just repeat most of the methods
> listed in the "eager execution" one.
>

Had a look; it seems, at a first reading, to be one more of the
"wouldn't it be great to follow both paths of a branch, and once we
figure out which way the branch went, we'll copy registers from one path
over to the other, and, oh, BTW, its not clear that it wouldn't have
been less work to just re-execute the second path" ideas.
From: Robert Myers on
On Nov 17, 7:21 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
> Matt wrote:
> > Have you seen the "eager execution" patent application (app. no.
> > 20090172370)? I mentioned it in my blog a while back (
> >http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th...
> > ). This patent application mentions some ways to use the clusters to
> > speed up single thread performance. There are some more patent
> > applications on this topic, but they just repeat most of the methods
> > listed in the "eager execution" one.
>
> Had a look; it seems, at a first reading, to be one more of the
> "wouldn't it be great to follow both paths of a branch, and once we
> figure out which way the branch went, we'll copy registers from one path
> over to the other, and, oh, BTW, its not clear that it wouldn't have
> been less work to just re-execute the second path" ideas.

I've stopped reading papers about such schemes. Everything falls
apart when you ask the question, "But is it worth it?" as to power and
die area.

As to copying registers, though, is that what you'd really do? Why
wouldn't it be just like register renaming?

Robert.