From: nedbrek on
Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4BE72955.9000809(a)patten-glew.net...
> On 1/25/2010 3:43 AM, nedbrek wrote:
>> Hello all,
>>
>> "Andy "Krazy" Glew"<ag-news(a)patten-glew.net> wrote in message
>> news:4B5C999C.9060301(a)patten-glew.net...
>>> nedbrek wrote:
>>>> That's where my mind starts to boggle. I would need see branch
>>>> predictor
>>>> and serialization data showing a window this big would deliver
>>>> significant performance gains. We were looking at Itanium runs of
>>>> Spec2k
>>>> built by Electron (i.e. super optimized). We were assuming very heavy
>>>> implementation (few serializing conditions). We were unable to scale
>>>> this
>>>> far.
>
> By the way, Ned's comment about needing to see branch prediction data
> indicates
> a fundamental misunderstanding of speculative multithreadng.

IIRC, my comments were in reference to a generic, big, OOO. I am heavily
influenced by my history (aren't we all)...

When we were looking at Itanium, we were attempting to sell _any_ OOO
machine to a group solidly opposed to OOO. We had to find a minimal set of
complexity to justify the performance and design complexity. Layering on
unknown/unproven techinques would have been suicide (or, at least, more
suicidal than the idea was already). We hadn't seen SpMT, per se, but there
was a similar idea published we had read (~2000-2003, I forget the name, big
OOO running heavily speculative, possibly wrong, checked by a wide IO).

When we switched to x86, we had similar problems for different reasons. The
x86 guys were (and probably still are) risk averse. They have a set beat
pattern to hit, and cannot afford to miss. Minor, incremental changes were
possible - but even those are hard to sell. An idea had to sell itself, by
itself - in this case, the large window. Layering in more complexity was
impossible.

This creates an interesting argument:
What killed uarch research/development?
1) Pentium 4
2) Itanium
3) Collapse of the ~2000 Internet bubble
4) No killer apps to use up perf
5) Other? (crazy conspiracy theories can go here)

Ned


From: Robert Myers on
On May 13, 8:44 pm, "nedbrek" <nedb...(a)yahoo.com> wrote:
> Hello all,
>
> "Andy 'Krazy' Glew" <ag-n...(a)patten-glew.net> wrote in messagenews:4BE72955.9000809(a)patten-glew.net...
>
>
>
>
>
> > On 1/25/2010 3:43 AM, nedbrek wrote:
> >> Hello all,
>
> >> "Andy "Krazy" Glew"<ag-n...(a)patten-glew.net>  wrote in message
> >>news:4B5C999C.9060301(a)patten-glew.net...
> >>> nedbrek wrote:
> >>>> That's where my mind starts to boggle.  I would need see branch
> >>>> predictor
> >>>> and serialization data showing a window this big would deliver
> >>>> significant performance gains.  We were looking at Itanium runs of
> >>>> Spec2k
> >>>> built by Electron (i.e. super optimized).  We were assuming very heavy
> >>>> implementation (few serializing conditions). We were unable to scale
> >>>> this
> >>>> far.
>
> > By the way, Ned's comment about needing to see branch prediction data
> > indicates
> > a fundamental misunderstanding of speculative multithreadng.
>
> IIRC, my comments were in reference to a generic, big, OOO.  I am heavily
> influenced by my history (aren't we all)...
>
> When we were looking at Itanium, we were attempting to sell _any_ OOO
> machine to a group solidly opposed to OOO.  We had to find a minimal set of
> complexity to justify the performance and design complexity.  Layering on
> unknown/unproven techinques would have been suicide (or, at least, more
> suicidal than the idea was already).  We hadn't seen SpMT, per se, but there
> was a similar idea published we had read (~2000-2003, I forget the name, big
> OOO running heavily speculative, possibly wrong, checked by a wide IO).
>
> When we switched to x86, we had similar problems for different reasons.  The
> x86 guys were (and probably still are) risk averse.  They have a set beat
> pattern to hit, and cannot afford to miss.  Minor, incremental changes were
> possible - but even those are hard to sell.  An idea had to sell itself, by
> itself - in this case, the large window.  Layering in more complexity was
> impossible.
>
> This creates an interesting argument:
> What killed uarch research/development?
> 1) Pentium 4
> 2) Itanium
> 3) Collapse of the ~2000 Internet bubble
> 4) No killer apps to use up perf
> 5) Other? (crazy conspiracy theories can go here)
>
0) Power constraints. Both Pentium 4 and Itanium must have
contributed mightily to Intel's risk-aversion in that department.
Having smart phones and ARM nipping at Intel's one-trick-pony can't be
helping, either. There are no more transistors and/or watts to throw
at anything.

Robert.
From: MitchAlsup on
On May 13, 7:44 pm, "nedbrek" <nedb...(a)yahoo.com> wrote:
> When we switched to x86, we had similar problems for different reasons.  The
> x86 guys were (and probably still are) risk averse.  They have a set beat
> pattern to hit, and cannot afford to miss.  Minor, incremental changes were
> possible - but even those are hard to sell.  

{A pause is necessary here, just to catch my breath.}

Excepting for the in for architectural misstep down the P4 direction
and then a retreat back to Pentium Pro microarchitecture, has there
been anything other than architectural refinement? More cache, new
buss/interconnect, more prediction, better decoding, tweek the memory
and I/Os; and yet the basic infrastructiure of PP survives to this
day.

This evolution was "hard to sell"? even considering the 50M/100M per
year rates of selling them?

> This creates an interesting argument:
> What killed uarch research/development?
> 1) Pentium 4
> 2) Itanium
> 3) Collapse of the ~2000 Internet bubble
> 4) No killer apps to use up perf
> 5) Other? (crazy conspiracy theories can go here)

Other: We have exploited all the real architecture invented in 1959
(Stretch), 1962 (6600), 1965 (360/91), and 1967 (360/85) to their
natural evolutionary optimal implementations (i.e. dead ends). To this
we invented branch prediction (although vestiments existed as early as
1967-8 (7600)), and a myriad of bells and whistles to nickle and dime
ourselves to were we are to day.

In my opinion, the way forward in the big-computer realm is threads,
yet one cannot exploit threads with current languages (memory models
in particular), our current synchronization means (and the memory
traffic it entails), and perhaps some departure from the vonNeumann
model itself (only one thing is happening at once on a per thread
basis).

In my opinion, the way forward in the low-power realm is also threads.
Here the great big OoO machine microarchitectures burn more power than
deliver performance. Yet evolving backwards down from the BG OoO
machines is not possible while benchmarks remains monothreaded even
though smaller simpler CPUs deliver more power per watt and more power
per unit die area. Yet, one does not have to evolve back "all that
far" to achieve a much better balance between performance and
performance/watt. However, I have found this a hard sell. None of the
problems mentioned above get any easier, in fact they become more
acute as you end up threading more things.

Thus, I conclude that:
6) running out of space to evolved killed of microarchitectural
inovation.

{And with the general caveat that no company actually does
architectural or microarchitectural research, each does development
based on short-medium term goals. Research happens en-the-large as
various companies show their wares and various competitors attempt to
incorporate or advance their adversary's developments. Much like
bological evolution.}

Mitch
From: Andy 'Krazy' Glew on
On 5/13/2010 5:44 PM, nedbrek wrote:
> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message news:4BE72955.9000809(a)patten-glew.net...
>> On 1/25/2010 3:43 AM, nedbrek wrote:

>> By the way, Ned's comment about needing to see branch prediction data
>> indicates a fundamental misunderstanding of speculative multithreadng.
>
> IIRC, my comments were in reference to a generic, big, OOO. I am heavily
> influenced by my history (aren't we all)...

Ah. My reasoning in the 1990s had run something like:

* to get more performance from single threaded programs we need to increase the number of instructions in flight, and
hence the instruction window

* branch mispredictions and other serializations limit the number of instructions that a single sequencer, a single
stream of instructions can supply

* therefore, to take advantage of a large instruction window for a logically single threaded program, one must supply
instructions from multiple points in that program, multiple sequencers. => multiple threads within the logical single
threaded program.

Either SpMT, or some other way ofexpliting control independence. SpMT is rather coarse grained; I suspect that the nxt
step after SpMT would be something like static dataflow.

QED

This argument doesn't say when diminishing returns hits the OOO window. I like the kilo-instruction window research.
But, eventually, it will hit.


> When we were looking at Itanium, we were attempting to sell _any_ OOO
> machine to a group solidly opposed to OOO.

I'm getting historical context.

I attempted to sell OOO, small window, big window, and then SpMT to Itanium circa 1998. The original Tejas. But the OOO
got sidetracked, and while they were interested in SpMT, they wanted SpMT as an alternative to OOO. And, much as I like
SpMT, OOO is a more proven technology. They also liked run-ahead.

You tried again circa 2003?


> When we switched to x86, we had similar problems for different reasons. The
> x86 guys were (and probably still are) risk averse. They have a set beat
> pattern to hit, and cannot afford to miss. Minor, incremental changes were
> possible - but even those are hard to sell. An idea had to sell itself, by
> itself - in this case, the large window. Layering in more complexity was
> impossible.
>
> This creates an interesting argument:
> What killed uarch research/development?
> 1) Pentium 4
> 2) Itanium
> 3) Collapse of the ~2000 Internet bubble
> 4) No killer apps to use up perf
> 5) Other? (crazy conspiracy theories can go here)

6) Collapse of all effective competition to Intel and x86. Without other companies doing different things, Intel has
little incentive to innovate.

7) Cost of fabs. High cost => risk aversion.

Although overall I see two major wrong turns (Pentium 4 and Itanium), coupled to a lack of demand (no killer apps),
leading to a situation where the VLSI got dense enough for multicore, and multicore will absorb all mindshare for a
decade or so.

Plus the power issues. Which were exacerbated by Pentium 4's high frequency approach.
From: nedbrek on
Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4BECF264.8060401(a)patten-glew.net...
> On 5/13/2010 5:44 PM, nedbrek wrote:
>> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
>> news:4BE72955.9000809(a)patten-glew.net...
>>> On 1/25/2010 3:43 AM, nedbrek wrote:
>
>>> By the way, Ned's comment about needing to see branch prediction data
>>> indicates a fundamental misunderstanding of speculative multithreadng.
>>
>> IIRC, my comments were in reference to a generic, big, OOO. I am heavily
>> influenced by my history (aren't we all)...
>
> Ah. My reasoning in the 1990s had run something like:
>
> * therefore, to take advantage of a large instruction window for a
> logically
> single threaded program, one must supply instructions from
> multiple points in that program, multiple sequencers. => multiple threads
> within the logical single threaded program.
>
> Either SpMT, or some other way ofexpliting control independence. SpMT is
> rather coarse grained; I suspect that the next step after
> SpMT would be something like static dataflow.

I grew up under Yale Patt, with "10 IPC on gcc". A lot of people thought it
was possible, without multiple IPs.

>> When we were looking at Itanium, we were attempting to sell _any_ OOO
>> machine to a group solidly opposed to OOO.
>
> I'm getting historical context.
>
> I attempted to sell OOO, small window, big window, and then SpMT to
> Itanium
> circa 1998. The original Tejas. But the OOO got sidetracked,
> and while they were interested in SpMT, they wanted SpMT as an alternative
> to
> OOO. And, much as I like SpMT, OOO is a more proven technology.
> They also liked run-ahead.
>
> You tried again circa 2003?

Yes, I will need to draw up the exact timeline sometime. I started in MRL
in Jan 01, as part of a group (of two, counting me!) to develop a new
Itanium strawman. We had a blank check, whatever it takes to make Itanium
the performance leader. IIRC, by 2003 things were actually starting to wind
down, as we were coming to realize that nothing would ever be done. But
yea, 2002-2003.

>> This creates an interesting argument:
>> What killed uarch research/development?
>> 1) Pentium 4
>> 2) Itanium
>> 3) Collapse of the ~2000 Internet bubble
>> 4) No killer apps to use up perf
>> 5) Other? (crazy conspiracy theories can go here)
>
> 6) Collapse of all effective competition to Intel and x86. Without other
> companies doing different things, Intel has little incentive to innovate.

Intel's biggest competitor is itself (competition among teams is probably
too aggressive). I would phrase this rather as, "Intel could allow the more
(risky) innovative ideas to get tabled, in favor of less risky
alternatives." Performance is going up, just in a more evolutionary, rather
than revolutionary manner.

> 7) Cost of fabs. High cost => risk aversion.

Definitely. That doesn't mean a small team can't be set aside to do
something revolutionary. The 80 core thing was this sort of idea, only done
terribly wrong.

> Although overall I see two major wrong turns (Pentium 4 and Itanium),
> coupled
> to a lack of demand (no killer apps), leading to a situation where the
> VLSI got
> dense enough for multicore, and multicore will absorb all mindshare for a
> decade or so.
>
> Plus the power issues. Which were exacerbated by Pentium 4's high
> frequency
> approach.

Yes, multicore is the new bandwagon. P4 pushed the frequency pendulum too
far, and now we've overreacted.

The ironic thing, (which we demonstrated, and which made us hugely
unpopular) is that massive many-core burns just as much (or more) power than
a smart OOO on anything but grossly parallel applications.

Ned