From: nedbrek on
Hello all,

"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
news:f2f8cde7-2bed-4003-8f40-56fd446df75f(a)j35g2000yqm.googlegroups.com...
On May 13, 8:44 pm, "nedbrek" <nedb...(a)yahoo.com> wrote:
>> This creates an interesting argument:
>> What killed uarch research/development?
>> 1) Pentium 4
>> 2) Itanium
>> 3) Collapse of the ~2000 Internet bubble
>> 4) No killer apps to use up perf
>> 5) Other? (crazy conspiracy theories can go here)
>>
> 0) Power constraints. Both Pentium 4 and Itanium must have
> contributed mightily to Intel's risk-aversion in that department.
> Having smart phones and ARM nipping at Intel's one-trick-pony can't be
> helping, either. There are no more transistors and/or watts to throw
> at anything.

Total power has remained fairly steady (~130 W at the top end), but power
per core is going down. The important thing to remember is that in
2003-2005, smart phones were completely off the radar at Intel.

It should be a lot easier to get (at least) two different designs - one
ultra lowe power (which sacrifices perf), one with good perf for more power
(but not crazy power).

Intel has finally achieved this with Atom and Core. The question is whether
they will merge, or allow them to drive separately.

Ned


From: nedbrek on
Hello,

"MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
news:548cf7ae-992a-4a32-9a1c-ba868b0dfdef(a)g21g2000yqk.googlegroups.com...
On May 13, 7:44 pm, "nedbrek" <nedb...(a)yahoo.com> wrote:
>> When we switched to x86, we had similar problems for different reasons.
>> The
>> x86 guys were (and probably still are) risk averse. They have a set beat
>> pattern to hit, and cannot afford to miss. Minor, incremental changes
>> were
>> possible - but even those are hard to sell.
>
> Excepting for the in for architectural misstep down the P4 direction
> and then a retreat back to Pentium Pro microarchitecture, has there
> been anything other than architectural refinement? More cache, new
> buss/interconnect, more prediction, better decoding, tweek the memory
> and I/Os; and yet the basic infrastructiure of PP survives to this
> day.

The P4 was a major failure. The question is, having failed, do we fix what
is wrong, or throw everything away and restore from backups? I believe P4
could be done, right (trace cache, high frequency).

> This evolution was "hard to sell"? even considering the 50M/100M per
> year rates of selling them?

There are a lot more (relatively simple) changes that can be done to P6 (P6
< P4, how did we get here). Ask Andy, I'm sure he can give you 100. Even
if 90% don't work out, it could be a lot different.

>> This creates an interesting argument:
>> What killed uarch research/development?
>> 1) Pentium 4
>> 2) Itanium
>> 3) Collapse of the ~2000 Internet bubble
>> 4) No killer apps to use up perf
>> 5) Other? (crazy conspiracy theories can go here)
>
> Other: We have exploited all the real architecture invented in 1959
> (Stretch), 1962 (6600), 1965 (360/91), and 1967 (360/85) to their
> natural evolutionary optimal implementations (i.e. dead ends). To this
> we invented branch prediction (although vestiments existed as early as
> 1967-8 (7600)), and a myriad of bells and whistles to nickle and dime
> ourselves to were we are to day.

Maybe you're right. It just feels so wrong.

> In my opinion, the way forward in the big-computer realm is threads,
> yet one cannot exploit threads with current languages (memory models
> in particular), our current synchronization means (and the memory
> traffic it entails), and perhaps some departure from the vonNeumann
> model itself (only one thing is happening at once on a per thread
> basis).

I've always fought against threading. Uarchitects are few in number,
programmers are large in number. It is our job to make them more efficient,
not to demand they conform to us. But that doesn't change reality...

> In my opinion, the way forward in the low-power realm is also threads.
> Here the great big OoO machine microarchitectures burn more power than
> deliver performance. Yet evolving backwards down from the BG OoO
> machines is not possible while benchmarks remains monothreaded even
> though smaller simpler CPUs deliver more power per watt and more power
> per unit die area. Yet, one does not have to evolve back "all that
> far" to achieve a much better balance between performance and
> performance/watt. However, I have found this a hard sell. None of the
> problems mentioned above get any easier, in fact they become more
> acute as you end up threading more things.

We can take (have taken) a step (or two) back, but won't we just take that
step back? ARM A9 is (small) OOO. There have to be people looking at
larger OOO.

> Thus, I conclude that:
> 6) running out of space to evolved killed of microarchitectural
> inovation.

Are you saying:
1: no improvement is possible
2: the improvements aren't worth the cost (power, or design complexity
[which translates to risk])

> {And with the general caveat that no company actually does
> architectural or microarchitectural research, each does development
> based on short-medium term goals. Research happens en-the-large as
> various companies show their wares and various competitors attempt to
> incorporate or advance their adversary's developments. Much like
> bological evolution.}

Perhaps. But it can be shown that even academics (and you can lump groups
like MRL into "academics") are not doing uarch research anymore (compare the
proceedings of MICRO or ISCA from 2003 and 2009).

Ned


From: Robert Myers on
On May 14, 8:05 am, "nedbrek" <nedb...(a)yahoo.com> wrote:

>
> Yes, multicore is the new bandwagon.  P4 pushed the frequency pendulum too
> far, and now we've overreacted.
>
> The ironic thing, (which we demonstrated, and which made us hugely
> unpopular) is that massive many-core burns just as much (or more) power than
> a smart OOO on anything but grossly parallel applications.
>

Performance is hard to measure and hard to sell.

Someone at Intel (Andy Grove?) said something like, "It's the
Megahertz they buy." Thus, Netburst, with frequency being the first
priority and performance only a second.

Physics intervened, and Intel needed something else to sell more of to
customers. The "more" is now "more cores."

The competing "more" is more battery life.

Since "everything" is now x86, IPC becomes an interesting metric. You
need to get someone (and probably Intel) interested in selling more
IPC.

Robert.

From: Andy 'Krazy' Glew on
On 5/14/2010 5:31 AM, nedbrek wrote:

> We can take (have taken) a step (or two) back, but won't we just take that
> step back? ARM A9 is (small) OOO. There have to be people looking at
> larger OOO.

I think of it as a rising sawtooth wave. Wrt OOO, with a long period; there are shorter period sawtooth waves all over
the place, for easier things like cache size.

It seems highly likely that OOO is coming back in the really small processor for smart phone space. ARM Cortex A9, etc.

If the P6 generation of OOO were the N-th generation of OOO (for N=2, maybe 3), probably, this N+1-st, smart-phone,
generation of OOO will eat up all of the good ideas of the previous generation, and then make some small advancements.
And then the wheel of reincarnation will turn again, e.g. there will be a power wall as we try to get to contact-lens
PCs, and the sawtooth will repeat itself.


> Perhaps. But it can be shown that even academics (and you can lump groups
> like MRL into "academics") are not doing uarch research anymore (compare the
> proceedings of MICRO or ISCA from 2003 and 2009).

As one of the founders of Intel MRL, I blame myself and others. MRL drifted into incremental research - tweaking a
branch predictor here, using a technique that makes OOO more relevant to higher clock frequencies there, etc. It did
not do any fundamental new work. The kiss of death came when MRL's uarch work was led by an academic who tried to
please the powers that be by making MRL relevant to Itanium research.

Except for Akkary's DMT, and to a much lesser extent my SpMT work - both efforts which did not originate at MRL, but
which originated before MRL. I think that DMT/SpMT were "fundamental" in that they attack a fundamental scaling problem
of OOO execution, the one you identified in your earlier post, Ned: you can't make an OOO much bigger if you can't feed
it from more than a single place. It may not be everyone's idea of fundamental, but it has a certain theoretical,
analytical basis: it is necessarily true in extrapolation.

There could certainly be other fundamental work. E.g. making threading much better - although I have ideas, I am not so
sure what the fundamental work is, but I'm willing to listen. Fundamental limits in power and energy efficiency, such
as those pointed to Feynman - how can stuff like adiabatic and charge recovery be made relevant to Intel? The
opportunity to work with Intel's CRL, Circuits Research Lab, was somewhat lost.

Don't get me wrong: MRL was reasonably successful as an academic research group. When I was at AMD, many of the other
AMD CPU architects commented that most of the most interesting work in ISCA, ASPLOS, etc., was coming from Intel.

But: when have academics ever really done advanced microarchitecture research? Where did OOO come from: well,
actually, Yale Patt and his students kept OOO alive with HPSm, and advanced it. But it certainly was not a popular
academic field, it was a single research group. The hand-me-down stories is that OOO was discriminated against for
years, and had to publish at second tier conferences.

The next set of uarch ideas are probably in that sort of half-hidden state in academia, and gestating in some crank
students' mind. While the mass of academia and industry follow "common sense" trends to exhaustion.
From: Andy 'Krazy' Glew on
On 5/14/2010 5:05 AM, nedbrek wrote:

>> * therefore, to take advantage of a large instruction window for a
>> logically
>> single threaded program, one must supply instructions from
>> multiple points in that program, multiple sequencers. => multiple threads
>> within the logical single threaded program.
>>
>> Either SpMT, or some other way ofexpliting control independence. SpMT is
>> rather coarse grained; I suspect that the next step after
>> SpMT would be something like static dataflow.
>
> I grew up under Yale Patt, with "10 IPC on gcc". A lot of people thought it
> was possible, without multiple IPs.

I keep forgetting that you are one of Yale's students.

Yale is my academic grandfather, since Hwu was my MS advisor.

---

Anyway: I love Yale, but I differ here.

"IPC", instructions per clock, is much lss fundamental than OOO instruction window size - the difference between the
oldest and youngest instruction executing at any time. Much less fundamental than MLP, the number of cache misses
*outstanding* at any time.

Clock speed, frequency, IPC, is fungible. At least it was in the Willamette timeframe, which was al about increasing
clock speed, and reducing the IPC.

Hmm, I think I am just realizing that we need different metrics, with different acronyms. I want to express the number
of outstanding operations. IPC is not a measure of ILP. OOO window size is extreme. A lower number is the number of
insructions simultaneously in some stage of execution; more precisely, simultaneously at the same stage of exection.

"SIX"?: simultaneous instructions in execution? "SIF"?: ... in flight? "SMF"?: simultaneous memory operations in
flight?

I once heard a famous OOO professor diss a paper submission that was about a neat trick to mae 2-wide superscalar run
faster, saying that we should reject all papers not 16-wide. Since I knew that Willamette was effectively quadrupling
the clock frequency, I felt that even narrow OOO was interesting. Not fundamental, but interesting. While IPC is not
fundamntal either.