Big OOO, SpMT, and possible designs (Was Re: Free/Open x86 Sim) [Computer Architecture]

Prev: A post to comp.risks that everyone on comp.arch should read
Next: Call for papers : HPCS-10, USA, July 2010

From: Robert Myers on 22 May 2010 16:51

On May 22, 2:51 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> On 5/22/2010 10:38 AM, Robert Myers wrote:
>
>
>
>
>
> > On May 21, 9:59 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
> >> On 5/21/2010 10:02 AM, Robert Myers wrote:
> >>>> On 5/20/2010 7:02 PM, David Kanter wrote:
> >>>>> Anyway, I think the concept you are trying to get at is what I'd call
> >>>>> a 'cross section'. Essentially if you think of the CPU as a physical
> >>>>> pipeline (or the memory hierarchy as a pipeline), you want the cross
> >>>>> sectional area. So perhaps the right terms are 'memory cross section'
> >>>>> and 'instruction cross section'?
>
> >>>>> DK
>
> >>>> Exactly: a cross section.
>
> >>> The use of "cross section" in this context seems not especially apt,
> >>> as one takes the cross section of a pipe transverse to the pipe.
>
> >> But that's exactly what we are talking about: the width, or diameter, of the pipeline that is processing instructions.
>
> > But the second dimension that makes it an area and not a length is
> > parallel to the axis of the pipe. If not, I'm confused. I think you
> > want to convey : how many at once and how far apart.
>
> > Robert.
>
> I think that we may be trying to push the physical analogy too far here.
>
> If you think of a computer pipeline as a 1 dimensional structure, or perhaps n=2 dimensional (1 dimension = position in
> the pipeline, plus the number of instructions at that stage in the pipeline possibly being considered a second
> dimension), then a n-1 dimensional cutting hyperplane has n=1. In general, in an n-dimensional space, we talk of
> cross-sections having n-1 dimensions.
>
> But, heck, I think of computer pipelines as having 3 or more dimensions. For example:
>
> I often think of pipelines as a sequence of buffers connected in some order, with data flowing between them. The
> buffers are themselves 2-dimensional in planar VLSI. So, cross-dimension might mean the area of silicon occupied by the
> instructions that are in flight, and executing, in buffers.
>
> Or if you don't like that, how about the more traditional pipeline, with an extra dimension being the number of
> processors in an MP system.
>
> We are pushing the physical or spatial analogy too far here. It's just terminology. If David Kanter finds that the
> term "cross section" helps him understand what the point I am trying to convey - and I think it does - great. If you
> don't, too bad. Terminology is arbitrary; but it also matters, because it aids understanding. If there is a term that
> you like, which aids many other folk, I'll use that. Any suggestions?
>
> So far we have
> * "instructions in flight (at the same stage of execution)"
> * cross-section
> * how many at once

Ok, so now I *have* reread the discussion:

"IPC", instructions per clock, is much lss fundamental than OOO
instruction window size - the difference between the
oldest and youngest instruction executing at any time.

Here, I think, is the exact measure you want, and you will have to
forgive me for thinking like a fluid mechanicist.

As each instruction appears, it starts releasing dye into the stream.
As the instruction sits there and ages, the stream of dye gets longer
and longer.

If you look at any stage of instruction (your cross-section), the
leading edge of the dye pattern makes a curve, and the integral under
that curve (which is not the same as the instruction window size), is
a meaningful measure of parallelism, as you proposed, but it is *not*
the same as a cross-section.

Robert.

From: nedbrek on 27 May 2010 09:05

Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4BEED0FE.9000809(a)patten-glew.net...
> On 5/15/2010 4:44 AM, nedbrek wrote:
>> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
>> news:4BED6F7D.9040001(a)patten-glew.net...
>>> On 5/14/2010 5:31 AM, nedbrek wrote:
>>
>> I know CMU worked on a uop scheme for VAX. Do you know if Bob Colwell
>> was
>> exposed to that while doing his graduate work? That would seem like a
>> pretty big boost from academia... I can imagine this work getting panned
>> in
>> the journals.
>
> I have no recollection of such CMU work, or of Bob talking about it during
> P6.

Ok. I will have to find a reference to it.

>> Jim Smith did 2bit branch predictors, Wikipedia says while at CDC. I'm
>> so
>> used to him being an academic... Yale and Patt take credit for the
>> gshare
>> scheme, although that looks like independent invention.
>
> Yale *and* Patt? I think you mean Yeh and Patt, for gselect.

It's funny, I always say it wrong too. You'd think seeing it in writing
would help...

> AFAIK gshare is due to McFarling.

You are, of course, correct. I always get branch predictor terminology
wrong.

> My own work in trace caches was done 1987-1990, and I brought it to Intel,
> where it was evaluated for the P6. Admittedly, I was a part-time grad
> student
> under Hwu at the University of Illinois at the time, working for some of
> that time at Gould and Motorola. Peleg and Weiser may have gotten the
> patent, but I
> am reasonably sure that I originated the term "trace cache".

It's amazing that such a big step as the P6 (from P5) considered the
additional risk of a trace cache. If you all had known that P6 would be the
"ultimate microarchitecture" (in both senses of the word), you might have
pushed harder to get it in. :)

Ned

From: nedbrek on 27 May 2010 09:13

Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4BF4D6E2.7060702(a)patten-glew.net...
> On 5/19/2010 11:23 PM, Andy 'Krazy' Glew wrote:
>
> Many refugees from these companies spread throughout the rest of the
> industry,
> including Intel and AMD, carrying their attitudes that of course OOO could
> not be pushed further.
>
> At the beginning of Willamette I remember Dave Sager coming back from an
> invitation only meeting - Copper Mountain? - of computer architects who
> all agreed that OOO could not be pushed further. Nobody asked my opinion.
> And, I daresay, that nobody at that conference had actually built
> a successful OOO processor; quite possibly, the only OOO experience at
> that conference was with the PPC 670.

We had a chance to work with Dave for about a year. It was a frustrating
experience at times, but I am glad for it. Dave has a lot of interesting
ideas, and the tenacity to see them implemented. He was also willing to
change his tack when presented with data.

Ned

From: nedbrek on 27 May 2010 09:20

Hello all,

<nmm1(a)cam.ac.uk> wrote in message
news:hsm0t7$q4b$1(a)smaug.linux.pwf.cam.ac.uk...
> In article <hslv9c$oav$1(a)news.eternal-september.org>,
> nedbrek <nedbrek(a)yahoo.com> wrote:
>><nmm1(a)cam.ac.uk> wrote in message
>>news:hskb6d$t96$1(a)smaug.linux.pwf.cam.ac.uk...
>>> In article
>>> <26c1c35a-d687-4bc7-82fd-0eef2df0f714(a)c7g2000vbc.googlegroups.com>,
>>> MitchAlsup <MitchAlsup(a)aol.com> wrote:
>>>>
>>>>I am saying that the way forward will necessarily take a step
>>>>backwards to less deep pipes and less wide windows and less overall
>>>>complexity in order to surmount (or better optimize for) the power
>>>>wall.
>>>
>>> It's also better for RAS and parallelism!
>>
>>Aggressive in-order is bad for RAS (Itanium). Parallelism is a broad
>>term,
>>obviously it is not better for extracting parallelism in a single thread!
>>:)
>
> The Itanic is most definitely NOT aggressively in-order,

Surely this is worthy of immortalization in a sig?!

> and its most aggressively out-of-order aspect was dropped with the Merced.
> The fact that its out-of-order properties are entirely different
> to the now widespread behind-the-scenes action reordering doesn't
> change that.

Perhaps there is some terminology/thinking differences between us. To me
"in-order" is characterized by RISC thinking (do things statically, in the
compiler). Out-of-order is post-RISC (smarter hardware). Even something
like branch prediction (which works fine for in-order machines) is an OOO
concept (to me).

In this way, Merced was the ultimate in-order machine: dumbed down branch
prediction, no hardware intelligence at all - everything pushed into the
compiler.

Of course, the Itanium instruction set has many OOO concepts: load misses do
not stall the machine until a dependent use, the RSE, etc.

This led to the quote, "Itanium is an OOO personality trapped in an in-order
body."

> And I can assure you that it is NOT obvious that you get better
> single-thread parallelism by deeper pipes and more complexity, or
> even wider windows. It just looks that way at first glance.

I am curious how you are going to extract single thread performance with a
slow clock and narrow window? Move all the complexity into the compiler?
More complexity than Itanium?

Ned

From: nmm1 on 27 May 2010 10:37

In article <htlo2b$4pr$1(a)news.eternal-september.org>,
nedbrek <nedbrek(a)yahoo.com> wrote:
>
>> and its most aggressively out-of-order aspect was dropped with the Merced.
>> The fact that its out-of-order properties are entirely different
>> to the now widespread behind-the-scenes action reordering doesn't
>> change that.
>
>Perhaps there is some terminology/thinking differences between us. To me
>"in-order" is characterized by RISC thinking (do things statically, in the
>compiler). Out-of-order is post-RISC (smarter hardware). Even something
>like branch prediction (which works fine for in-order machines) is an OOO
>concept (to me).

Boggle. To me, "in-order" and "out-of-order" mean exactly what they
say. An "in-order" machine executes the instructions in exactly the
order they occur, and does not overlap them, pre-execute any part of
them or otherwise reorder the instruction stream. This is generally
accepted to ignore memory access reordering, because that has been
near-universal for 50 years, except perhaps on some embedded chips.

>> And I can assure you that it is NOT obvious that you get better
>> single-thread parallelism by deeper pipes and more complexity, or
>> even wider windows. It just looks that way at first glance.
>
>I am curious how you are going to extract single thread performance with a
>slow clock and narrow window? Move all the complexity into the compiler?
>More complexity than Itanium?

I didn't say that. I said that it's not obvious that going further
down the deep pipe and massively complicated path will help. I.e.
we may be approaching what is effectively a dead-end.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Prev: A post to comp.risks that everyone on comp.arch should read
Next: Call for papers : HPCS-10, USA, July 2010