From: Terje Mathisen on
Mayan Moudgill wrote:
> Matt wrote:
>> ). This patent application mentions some ways to use the clusters to
>> speed up single thread performance. There are some more patent
>> applications on this topic, but they just repeat most of the methods
>> listed in the "eager execution" one.
>>
>
> Had a look; it seems, at a first reading, to be one more of the
> "wouldn't it be great to follow both paths of a branch, and once we
> figure out which way the branch went, we'll copy registers from one path
> over to the other, and, oh, BTW, its not clear that it wouldn't have
> been less work to just re-execute the second path" ideas.

I agree.

Even worse, as Andy already mentioned:

With multiple cores and dynamic frequency adjustments depending upon the
number of concurrently active cores, having one core totally idle
instead of trying to follow a low likelihood branch could allow the
active core enough thermal headspace to run enough faster to make up for
some missed branch predictions.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Brett Davis on
In article <4B02488A.1050209(a)patten-glew.net>,
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:

> > So this is separate CPUs running the same code, no sharing of register state.
> > There are some other things you can do also, if one CPU is farther ahead
> > and fails a speculated branch, the second CPU has less state to rewind, if
> > any, and so the second CPU takes the lead. Faster than one CPU.
>
> I take it that you are designing an ISA from scratch, rather than trying
> to apply SpMT to existing binaries (which have registers live across
> iterations).

The P4 proved that you did not need a new instruction set to do everything
that the RISC chips were doing. It also proved that you needed to recompile
to not risk negative performance improvements. (Later fixed.)

Best would be a CLIW instruction set with lots of visible registers, using
8 ALUs with only 16 x86 registers would be interesting. ;)
But likely not productive...

> Several of us have looked at "all state in memory" SpMT.

That violates the fact that registers are faster.

> > Few common folk can make any use of multi-cores, if I can turn my 8 core
> > Bulldozer into a 4 core thats 5% faster, I will, as will most.
>
> Actually, if you can turn off 4 of the cores completely, you will
> probably save enough power/heat that you can run the remaining cores
> more than 5-10% faster.
>
> Your idea may still be good, but you have to win more than 5%.

On rare occasion it would get 50%, a rigged benchmark will get ~95%.
Intel would get away with this, they have the marketing dollars to
decide what benchmarks are used by publications on and off the web.

Hypethreading benchmarks are the classic example, speedups you will
never see in the real world. Code fabricated to have an IPC less
than one even after OoO, so two copies can run and not bottleneck
on any of the bottlenecks, like only 2 ALUs in the CPU...

Actually Hypethreading was not that good, could not share an ALU,
the CPU had to be idle, so the benchmark had to be some variation
on pointer chasing code that did little real work. Something you
could optimize away and get a 5 to 10x speedup.

> > "Loops are easy." ;) Pray tell where these plentiful easy to speedup
> > areas are in CPU design. ;) Run strait into a brick wall you have. ;)
>
> I haven't.
>
> The guys I left behind have.

Touche. ;)

Brett
From: "Andy "Krazy" Glew" on
Mayan Moudgill wrote:
> Matt wrote:
>> Have you seen the "eager execution" patent application (app. no.
>> 20090172370)? I mentioned it in my blog a while back (
>> http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
>>
>> ). This patent application mentions some ways to use the clusters to
>> speed up single thread performance. There are some more patent
>> applications on this topic, but they just repeat most of the methods
>> listed in the "eager execution" one.
>>
>
> Had a look; it seems, at a first reading, to be one more of the
> "wouldn't it be great to follow both paths of a branch, and once we
> figure out which way the branch went, we'll copy registers from one path
> over to the other, and, oh, BTW, its not clear that it wouldn't have
> been less work to just re-execute the second path" ideas.

Eager execution predates MCMT. Tjaden and Flynn, circa 1968, if I
remember correctly, studied what we might now call eager execution. (By
the way, Tjaden

There might be novelty in implementing Eager Execution

My original MCMT eager execution ideas (from Wisconsin) worked in
conjunction with the second level scheduler and the PRF shared between
the clusters. When forking, you copied the renamer state (not really
that many bits). Because the PRF was shared between clusters, you
didn't need to transfer values, although you did need to wait until
values from the parent cluster had been written back, since you might
not have bypassing.

Eager execution is only a good idea if forking is close to free - or at
least does not disturb the parent thread. The same is true for SpMT.

---

I hope that AMD's patent applications are on inventions that were made
after I left AMD, or by people other than me - because, of course, if
they claim to have invented things that I either invented at AMD or
brought to AMD, their patents would have issues.

But I'm not going to investigate. I don't need the aggravation.
From: Mayan Moudgill on
Andy "Krazy" Glew wrote:
> Mayan Moudgill wrote:
>
>> Matt wrote:
>>
>>> Have you seen the "eager execution" patent application (app. no.
>>> 20090172370)? I mentioned it in my blog a while back (
>>> http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
>>>
>>>
>> Had a look; it seems, at a first reading, to be one more of the
>> "wouldn't it be great to follow both paths of a branch, and once we
>> figure out which way the branch went, we'll copy registers from one
>> path over to the other, and, oh, BTW, its not clear that it wouldn't
>> have been less work to just re-execute the second path" ideas.
>
> When forking, you copied the renamer state (not really
> that many bits).

You'll also have to duplicate the call-return prediction structure.

You'll have to tweak the address matching logic in the store queue.

You'll have to change the branch resolution clean-up. IMO, this is the
difficult bit.
- you have to free up some renamed registers (or, if you're using a
different implementation, partially purge the writeback queue).
- you have to free up entries in the store queue.

On a non-eager machine, the store queue is logically a FIFO; when a
branch mispredict resolves you (logically) move the valid pointer to the
last correctly predicted store. In the case of eager execution, the
store queue becomes a sparse structure, which in turn means that there
are going to be issues with allocation, load/store matching, and
identifying age for commit.

Even free-ing of mispredicted registers could be a little more expensive.
- You may need more bits to distinguish between true/false path predictions.
- If you're going to be eager on more than one branch, then mispredict
resoultion can potentially kill trees of rename sequences, not just
straight-line paths.
- Detecting a free register may become more of a challenge.
From: nmm1 on
In article <961ed6a9-3bb2-4e33-95cc-754910db5444(a)c3g2000yqd.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>On Nov 17, 7:21=A0pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
>> Matt wrote:
>> > Have you seen the "eager execution" patent application (app. no.
>> > 20090172370)? I mentioned it in my blog a while back (
>> >http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th...
>> > ). This patent application mentions some ways to use the clusters to
>> > speed up single thread performance. There are some more patent
>> > applications on this topic, but they just repeat most of the methods
>> > listed in the "eager execution" one.
>>
>> Had a look; it seems, at a first reading, to be one more of the
>> "wouldn't it be great to follow both paths of a branch, and once we
>> figure out which way the branch went, we'll copy registers from one path
>> over to the other, and, oh, BTW, its not clear that it wouldn't have
>> been less work to just re-execute the second path" ideas.
>
>I've stopped reading papers about such schemes. Everything falls
>apart when you ask the question, "But is it worth it?" as to power and
>die area.

Most of them fall apart long before that!

Eager execution may or may not be a winner for a few parallel paths,
but it very clearly can help only with 'pipeline' penalties and not
'preloading' ones. And it is the latter that are the killer in
almost all real programs. Sun's 'scout threads' have more potential,
because they can't be debunked using only the back of an envelope,
but I still have reservations.

I still think that there is serious potential in using a cooperative
approach to branch prediction, as I have posted before. It involves
adding a few, simple instructions that have no effect on the actions
on the program, but may be used to pass information from the compiler
to the prediction unit. Roughly:

SET_PRED n <registers>
Sets predictor n to an undefined hash of the register values.
USE_PRED n <confidence>
Uses that predictor, with confidence between (say) -7 and 7; negative
values imply the confidence of it not being taken.
PUSH_PREDS <predictors>
POP_PREDS <predictors>
Used at function calls and elsewhere to indicate the change in use.
Simply clearing a predictor is always a permitted action, and most
hardware would keep only a very small stack of them.

And, yes, my approach is specifically aimed at O-O languages, where
the same code is used for different objects, and where the path is
dependent more on the object than the instruction. Would it be a
significant gain? Dunno, but simulating it would be a damn good
research topic for a graduate student. Not hard, no special hardware
needed (so cheap), potentially valuable, ....


Regards,
Nick Maclaren.