From: Andy 'Krazy' Glew on
On 5/27/2010 6:05 AM, nedbrek wrote:
> Hello all,
>
> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
> news:4BEED0FE.9000809(a)patten-glew.net...
>> On 5/15/2010 4:44 AM, nedbrek wrote:
>>> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
>>> news:4BED6F7D.9040001(a)patten-glew.net...
>>>> On 5/14/2010 5:31 AM, nedbrek wrote:
>>>
>>> I know CMU worked on a uop scheme for VAX. Do you know if Bob Colwell
>>> was
>>> exposed to that while doing his graduate work? That would seem like a
>>> pretty big boost from academia... I can imagine this work getting panned
>>> in
>>> the journals.
>>
>> I have no recollection of such CMU work, or of Bob talking about it during
>> P6.
>
> Ok. I will have to find a reference to it.
>
>>> Jim Smith did 2bit branch predictors, Wikipedia says while at CDC. I'm
>>> so
>>> used to him being an academic... Yale and Patt take credit for the
>>> gshare
>>> scheme, although that looks like independent invention.
>>
>> Yale *and* Patt? I think you mean Yeh and Patt, for gselect.
>
> It's funny, I always say it wrong too. You'd think seeing it in writing
> would help...
>
>> AFAIK gshare is due to McFarling.
>
> You are, of course, correct. I always get branch predictor terminology
> wrong.
>
>> My own work in trace caches was done 1987-1990, and I brought it to Intel,
>> where it was evaluated for the P6. Admittedly, I was a part-time grad
>> student
>> under Hwu at the University of Illinois at the time, working for some of
>> that time at Gould and Motorola. Peleg and Weiser may have gotten the
>> patent, but I
>> am reasonably sure that I originated the term "trace cache".
>
> It's amazing that such a big step as the P6 (from P5) considered the
> additional risk of a trace cache. If you all had known that P6 would be the
> "ultimate microarchitecture" (in both senses of the word), you might have
> pushed harder to get it in. :)

My big motivations for trace cache, on P6, were

a) simplify decoding - and when we backed off from a 4444 template to 311, that made it easy enough.

(Besides, knowing what i know now, I think an AMD Kn-style decoder, or at least my interpretation of it, is feasible.
Instead of something like 4444, producing from 1 to 16 uops, it would always produce 4 uops - each decoder is told the
uop# it is to produce.)

b) support high instruction fetch bandwidth - like, 16 wide. which needed the hammock stuff below.

c) allow optimizations. avoid repeated work - save renames in the trace cache, etc., etc.

e.g. my UIUC work on trace caches was all about (1) reducing renamer complexity by caching renames. (2) doing hammocks
efficiently - convergent code. I had the ill-conceived notion that it was made easier if you see both the divergence
and the convergence, and all paths, in the same long instruction trace from the trace cache.

Caching the renames is probably still a good idea. Mitch did something similar.

Saving optimizations may still be a good idea. But the hammock optimizations, probably not. At UW I realized that I
could create a "trace cache for the branch predictor", which I called an unrolled BTB, that supported high instruction
fetch bandwidth without having to create a trace cache.
From: Andy 'Krazy' Glew on
On 5/27/2010 6:05 AM, nedbrek wrote:
> Hello all,
>
> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
> news:4BEED0FE.9000809(a)patten-glew.net...
>> On 5/15/2010 4:44 AM, nedbrek wrote:
>>> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
>>> news:4BED6F7D.9040001(a)patten-glew.net...
>>>> On 5/14/2010 5:31 AM, nedbrek wrote:
>>>
>>> I know CMU worked on a uop scheme for VAX. Do you know if Bob Colwell
>>> was exposed to that while doing his graduate work?
>>
>> I have no recollection of such CMU work, or of Bob talking about it during
>> P6.

But of course HPSm had done uops (I think) for VAX. And I had done uops for Motorola 68000, since that was my employer
around that time.

>> My own work in trace caches was done 1987-1990, and I brought it to Intel,
>> where it was evaluated for the P6. Admittedly, I was a part-time grad
>> student
>> under Hwu at the University of Illinois at the time, working for some of
>> that time at Gould and Motorola. Peleg and Weiser may have gotten the
>> patent, but I
>> am reasonably sure that I originated the term "trace cache".
>
> It's amazing that such a big step as the P6 (from P5) considered the
> additional risk of a trace cache. If you all had known that P6 would be the
> "ultimate microarchitecture" (in both senses of the word), you might have
> pushed harder to get it in. :)

My big motivations for trace cache, on P6, were

a) simplify decoding - and when we backed off from a 4444 template to 311, that made it easy enough.

(Besides, knowing what i know now, I think an AMD Kn-style decoder, or at least my interpretation of it, is feasible.
Instead of something like 4444, producing from 1 to 16 uops, it would always produce 4 uops - each decoder is told the
uop# it is to produce.)

b) support high instruction fetch bandwidth - like, 16 wide. which needed the hammock stuff below.

c) allow optimizations. avoid repeated work - save renames in the trace cache, etc., etc.

e.g. my UIUC work on trace caches was all about (1) reducing renamer complexity by caching renames. (2) doing hammocks
efficiently - convergent code. I had the ill-conceived notion that it was made easier if you see both the divergence
and the convergence, and all paths, in the same long instruction trace from the trace cache.

Caching the renames is probably still a good idea. Mitch did something similar.

Saving optimizations may still be a good idea. But the hammock optimizations, probably not. At UW I realized that I
could create a "trace cache for the branch predictor", which I called an unrolled BTB, that supported high instruction
fetch bandwidth without having to create a trace cache.

Plus, of course, we have barely gotten past 4-wide machines.
From: Andy 'Krazy' Glew on
On 5/27/2010 9:14 PM, Andy 'Krazy' Glew wrote:
> On 5/27/2010 6:05 AM, nedbrek wrote:

>> It's amazing that such a big step as the P6 (from P5) considered the
>> additional risk of a trace cache. If you all had known that P6 would be the
>> "ultimate microarchitecture" (in both senses of the word), you might have
>> pushed harder to get it in. :)

I know that I expected, after P6, to turn around and do it again, with the good ideas that did not make it into P6.

But, instead, Willamette and Itanium happened. Helped start MRL. Got divorced (P6 casualty). Went to Wisconsin, since
I didn't think Willamette or Itanium were such good ideas.


> My big motivations for trace cache, on P6, were
> ...
> b) support high instruction fetch bandwidth - like, 16 wide. which
> needed the hammock stuff below.
> ...
> Saving optimizations may still be a good idea. But the hammock
> optimizations, probably not. At UW I realized that I could create a
> "trace cache for the branch predictor", which I called an unrolled BTB,
> that supported high instruction fetch bandwidth without having to create
> a trace cache.
>
> Plus, of course, we have barely gotten past 4-wide machines.

Hmm, this may be obvious to others, but I only realized it after I pushed "enter":

Maybe the "frequency wall" will be the opportunity for wide out-of-order machines, like the 16-wide machine Yale has
wanted to build for so long.

If/after multicore/manycore have run out of steam. (For a PC, my guess is that you either want <= 16 cores, or 1000s.
But not numbers in between, like 100s. Except for graphics. Servers, datacenters, the movement of so much "PC" to
shared (space and time) systems, will justify a wider range of MultiCore.)

If we can finesse the power problems. But I suspect we can: GPUs have lots of processors, lots of ALUs. It's not lots
of ALUs that hurt, it's the overhead. GPUs amortize the overhead over many ALUs. OOO needs to learn how to do that.

Which brings us back to trace cache, etc. It's silly to rename things over and over again. Either save it. Or steal a
trick from the GPU book. If you are executing a loop (without too much control structure), but one which cannot be
parallelized, fetch an instruction once, and send it to 16 or 128 ALUs for the unrolled loop.
From: nmm1 on
In article <4BFF4951.2090004(a)patten-glew.net>,
Andy 'Krazy' Glew <ag-news(a)patten-glew.net> wrote:
>>
>> Plus, of course, we have barely gotten past 4-wide machines.
>
>Hmm, this may be obvious to others, but I only realized it after
>I pushed "enter":
>
>Maybe the "frequency wall" will be the opportunity for wide
>out-of-order machines, like the 16-wide machine Yale has wanted to
>build for so long.

I am sure that it would be a great success, academically.


Regards,
Nick Maclaren.
From: nedbrek on
Hello all,

<nmm1(a)cam.ac.uk> wrote in message
news:htm044$99q$1(a)soup.linux.pwf.cam.ac.uk...
> In article <htlo2b$4pr$1(a)news.eternal-september.org>,
> nedbrek <nedbrek(a)yahoo.com> wrote:
(Nick said "The Itanic is most definitely NOT aggressively in-order,")
>>
>>> and its most aggressively out-of-order aspect was dropped with the
>>> Merced.
>>> The fact that its out-of-order properties are entirely different
>>> to the now widespread behind-the-scenes action reordering doesn't
>>> change that.
>>
>>Perhaps there is some terminology/thinking differences between us. To me
>>"in-order" is characterized by RISC thinking (do things statically, in the
>>compiler). Out-of-order is post-RISC (smarter hardware). Even something
>>like branch prediction (which works fine for in-order machines) is an OOO
>>concept (to me).
>
> Boggle. To me, "in-order" and "out-of-order" mean exactly what they
> say. An "in-order" machine executes the instructions in exactly the
> order they occur, and does not overlap them, pre-execute any part of
> them or otherwise reorder the instruction stream. This is generally
> accepted to ignore memory access reordering, because that has been
> near-universal for 50 years, except perhaps on some embedded chips.

Ok, then in what way is Itanium not in-order? The front end scoops up
instructions, expands (NOP insertion, woohoo!), rotates and presents them to
rename and execution. Execute is set up to (almost) always be 1 cycle. DET
performs the predication, and then there's writeback. Everything flows
along, in program order.

Ned