From: nmm1 on
In article <ht1k70$lvu$1(a)news.eternal-september.org>,
ned <nedbrek(a)yahoo.com> wrote:
>>>
>>>That is to say, I would have put my money on the hardware guys to find
>>>parallelism in the instruction stream while software guys were still
>>>dithering about language aesthetics. I thought this way all the way
>>>up to the 90nm step for the P4.
>>
>> As you know, I didn't. The performance/clock factor (which is what
>> the architecture delivers) hasn't improved much.
>
>Near the end of my uarch career, I came to realize that much of "the
>game" is keeping perf/clock from collapsing while ramping clock. At
>least, that is about the only thing that has been successful.

Precisely. But we haven't seen any increase in clock rate in nearly
a decade now - isn't it time to accept that a rethink is needed?


Regards,
Nick Maclaren.
From: Andy 'Krazy' Glew on
On 5/20/2010 5:50 AM, Morten Reistad wrote:

> Yes, the compatibility argument is important. But noone can do magic.
> We are now at the end, or pretty close, of the rope regarding
> single processor performance on von-neumann computers. We used
> pipelining, oo execution and lots of other tricks to push this
> envelope a hundredfold or more. Now we are up against handling the
> logic expression in the code.

"Up against handling the logic expressions in the code." ? Hardly. Unless you mean "up against handling the logic
expressions in the code, whose execution is delayed because of memory".

Run this experiment on your favorite simulator. Note that many simulators are not capable of such limit studies.

Set the latency of all arithmetic operations - integer, FP, logical - to 0 cycles. Allow an infinite number of them to
execute per cycle. But keep all of the rest of the system the same - instruction window, # cache misses outstanding per
cycle, etc.

Your speedup is usually not that much. Usually not 2X. (Not except for multimedia, streaming, codes.)

You are usually limited by memory. Do the same thing for memory operations, especially cache misses - 0 latency,
infinite bandwidth - and you get a much better result. Especially if you handle branch mispredictions - in such an
idealized model, should branch mispredictions cost 0 cycles (in which case you have great speedups) or N cycles, where N
is the pipeline depth (in which case you get good, but not great, speedups, if you have a pipeline. Great speedups if
you similarly limit study the pipeline, because if everything takes zero cycles (run that limit study, because it will
probably show you that there are artifacts in your simulator)).

Do similar limit studies - make certain cache hit and miss latencies zero.

So long as there is a non-zero cache miss latency, you will see that instruction window is a bottleneck. Not just
static size (size of RS, ROB), but also dynamic size (distance between branch mispredictions). Making the window
infinite sized does not help unless you either reduce branch mispredictions, or have multiple sequencers.

I suspect that there is or should be a limit study that can be run to examine the speed of light wrt pefetchers into a
memory hierarchy of fixed physical parameters. I.e. you cannot make all memory close to all processing elements - you
are necessarily limited, literally, by the speed of light. Unfortunately, I do not know how to do it right now - I
think I have figured it out more than once, but that would be in notebooks that I no longer have access to.

I suspect that it can or should be possible to show that prefetching into a single L1/L2/L3/... cache hierarchy with a
single arbitrarily large instruction window is suboptimal. I suspect that it is necessary to have a multi-headed cache
hierarchy - multiple L1s per L2, multiple L2s per L3, etc. - in order to get the best performance.

This applies equally well to explicit parallel programming, as to the implicit parallelism of OOO dataflow execution
ILP and MLP. It applies just as well to prefetchers.

From: Robert Myers on
Andy 'Krazy' Glew wrote:
> On 5/19/2010 11:23 PM, Andy 'Krazy' Glew wrote:
>
>> I must admit that I am puzzled as to why this happened. I thought that
>> P6 showed (and is still showing) that OOO microarchitecture can be
>> successful. I would have expected Intel to bet on the proven winners, by
>> doing it over again. Didn't happen.
>
> One hypothesis, based on observation:
>
> Many non-x86 processor companies failed at about this time:
> DEC, IBM downsized, RISC.
>
> Many refugees from these companies spread throughout the rest of the
> industry, including Intel and AMD, carrying their attitudes that of
> course OOO could not be pushed further.
>
> At the beginning of Willamette I remember Dave Sager coming back from an
> invitation only meeting - Copper Mountain? - of computer architects who
> all agreed that OOO could not be pushed further. Nobody asked my
> opinion. And, I daresay, that nobody at that conference had actually
> built a successful OOO processor; quite possibly, the only OOO
> experience at that conference was with the PPC 670.
>

The big selling point of IA-64 was that it didn't have the N^2
complexity of out-of-order. Instead, true to the law of conservation of
complexity, it had lots of complexity elsewhere, and much of it turned
out to be of little benefit.

Patterson had been saying for a while(in print) that cores were already
too big and too complex.

Would an insiders-only meeting have had anything to add?

The economics of the business say you need a one-size-fits all design,
or maybe two, as previously suggested: one for very low power and
another for everything else.

Most desktop users already have more power than they need or even can
use, and many of the remaining volume consumers benefit more from extra
cores than they would from aggressive microarchitecture.

Who does that leave? A boutique processor for Wall Street quants who
are forever trying to outguess each other by a couple of milliseconds,
maybe.

Robert.
From: David Kanter on
> Hmm, I think I am just realizing that we need different metrics, with different acronyms.  I want to express the number
> of outstanding operations.  IPC is not a measure of ILP.  OOO window size is extreme.  A lower number is the number of
> insructions simultaneously in some stage of execution; more precisely, simultaneously at the same stage of exection.
>
> "SIX"?: simultaneous instructions in execution?  "SIF"?:  ... in flight?   "SMF"?: simultaneous memory operations in
> flight?

What do you mean by 'same stage of execution'?

Anyway, I think the concept you are trying to get at is what I'd call
a 'cross section'. Essentially if you think of the CPU as a physical
pipeline (or the memory hierarchy as a pipeline), you want the cross
sectional area. So perhaps the right terms are 'memory cross section'
and 'instruction cross section'?

DK
From: Andy 'Krazy' Glew on
On 5/20/2010 7:02 PM, David Kanter wrote:
>> Hmm, I think I am just realizing that we need different metrics, with different acronyms. I want to express the number
>> of outstanding operations. IPC is not a measure of ILP. OOO window size is extreme. A lower number is the number of
>> insructions simultaneously in some stage of execution; more precisely, simultaneously at the same stage of exection.
>>
>> "SIX"?: simultaneous instructions in execution? "SIF"?: ... in flight? "SMF"?: simultaneous memory operations in
>> flight?
>
> What do you mean by 'same stage of execution'?
>
> Anyway, I think the concept you are trying to get at is what I'd call
> a 'cross section'. Essentially if you think of the CPU as a physical
> pipeline (or the memory hierarchy as a pipeline), you want the cross
> sectional area. So perhaps the right terms are 'memory cross section'
> and 'instruction cross section'?
>
> DK

Exactly: a cross section.

I was trying to use "same stage of execution" to filter out pipeline effects. E.g. a machine with a 42 deep pipeline,
that is capable of only one load per cycle, latency of load data to load address of 1 cycle, etc., might be said to have
42 loads in flight at all time. I.e. a cross section of 42. However, most of that parallelism wuld be due to
instruction fetch effects, not the actual execution parallelism.