From: MitchAlsup on
On May 14, 7:31 am, "nedbrek" <nedb...(a)yahoo.com> wrote:
> Hello,
>
> "MitchAlsup" <MitchAl...(a)aol.com> wrote in message
> > In my opinion, the way forward in the low-power realm is also threads.
> > Here the great big OoO machine microarchitectures burn more power than
> > deliver performance. Yet evolving backwards down from the BG OoO
> > machines is not possible while benchmarks remains monothreaded even
> > though smaller simpler CPUs deliver more power per watt and more power
> > per unit die area. Yet, one does not have to evolve back "all that
> > far" to achieve a much better balance between performance and
> > performance/watt. However, I have found this a hard sell. None of the
> > problems mentioned above get any easier, in fact they become more
> > acute as you end up threading more things.
> > Thus, I conclude that:
> > 6) running out of space to evolved killed of microarchitectural
> > inovation.
>
> Are you saying:
> 1: no improvement is possible
> 2: the improvements aren't worth the cost (power, or design complexity
> [which translates to risk])

I am saying that the way forward will necessarily take a step
backwards to less deep pipes and less wide windows and less overall
complexity in order to surmount (or better optimize for) the power
wall.

I am also implying that as long as the way we mesure the utility of
machines with the current sets of benchmarks, there is a very great
aversion to redoing a microarchitecture that ends up with lower IPC,
even if that new microarchitecture will solve a large majority of the
power issues of the current microarchitectures.

There is risk in taking a few steps backwards, but there is reward in
repositioning a microarchitecture so that it is tuned at the
performace per power end of the spectrum. And I think it is unlikely
that we can evolve from the Great Big OoO machines to the medium OoO
(or partialy ordered) machines the battery powered world wants/needs.
If the x86 people are not going to figure out how to get into the 10mW-
to-50mW power envelope with decent performance, then they are leaving
a very big vacuum in which new architectures and microarchitectures
will find fruitful hunting grounds. I sure would like to be on a team
that did put x86s into this kind of power spectrum, and I think one
could get a good deal of the performance of a modern low power (5W)
laptop into that power range, given that as the overriding goal of the
project.

Mitch
From: nmm1 on
In article <26c1c35a-d687-4bc7-82fd-0eef2df0f714(a)c7g2000vbc.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:
>
>I am saying that the way forward will necessarily take a step
>backwards to less deep pipes and less wide windows and less overall
>complexity in order to surmount (or better optimize for) the power
>wall.

It's also better for RAS and parallelism!

>I am also implying that as long as the way we mesure the utility of
>machines with the current sets of benchmarks, there is a very great
>aversion to redoing a microarchitecture that ends up with lower IPC,
>even if that new microarchitecture will solve a large majority of the
>power issues of the current microarchitectures.

Unfortunately :-(

>There is risk in taking a few steps backwards, but there is reward in
>repositioning a microarchitecture so that it is tuned at the
>performace per power end of the spectrum. And I think it is unlikely
>that we can evolve from the Great Big OoO machines to the medium OoO
>(or partialy ordered) machines the battery powered world wants/needs.

Agreed.


Regards,
Nick Maclaren.
From: Andy 'Krazy' Glew on
On 5/14/2010 12:12 PM, MitchAlsup wrote:
> If the x86 people are not going to figure out how to get into the 10mW-
> to-50mW power envelope with decent performance, then they are leaving
> a very big vacuum in which new architectures and microarchitectures
> will find fruitful hunting grounds. I sure would like to be on a team
> that did put x86s into this kind of power spectrum, and I think one
> could get a good deal of the performance of a modern low power (5W)
> laptop into that power range, given that as the overriding goal of the
> project.

I'd like to be on a team that put ARM into that power range.

Ooops, it's already there. And Intel isn't. Yet.

But there is a window of opportunity for non-x86 right now.
From: nedbrek on
Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4BED72E4.2040102(a)patten-glew.net...
> On 5/14/2010 5:05 AM, nedbrek wrote:
>> I grew up under Yale Patt, with "10 IPC on gcc". A lot of people thought
>> it
>> was possible, without multiple IPs.
>
> I keep forgetting that you are one of Yale's students.
>
> Yale is my academic grandfather, since Hwu was my MS advisor.

Student is overly generous. "Fan-boy" would better describe the
relationship :) I had Superscalar Processors (18-545) with John Shen as an
undergraduate, but that was about it. But, I have always been a processor
junkie, and at Intel I was able to hook up with Professor Shen again. It
was great! (but painful)

> Anyway: I love Yale, but I differ here.
>
> "IPC", instructions per clock, is much lss fundamental than OOO
> instruction
> window size - the difference between the oldest and youngest instruction
> executing at any time. Much less fundamental than MLP, the number of
> cache
> misses *outstanding* at any time.
>
> Clock speed, frequency, IPC, is fungible. At least it was in the
> Willamette
> timeframe, which was al about increasing clock speed, and reducing the
> IPC.
>
> Hmm, I think I am just realizing that we need different metrics, with
> different
> acronyms. I want to express the number of outstanding operations.
> IPC is not a measure of ILP. OOO window size is extreme. A lower number
> is
> the number of insructions simultaneously in some stage of execution;
> more precisely, simultaneously at the same stage of exection.
>
> "SIX"?: simultaneous instructions in execution? "SIF"?: ... in flight?
> "SMF"?: simultaneous memory operations in flight?

You're right, but IPC is easiest to measure! :) For similar sorts of
machines, IPC is a pretty good indicator of ILP (which is somewhat vaguely
related to MLP). For example, bzip always has higher IPC than gcc - largely
because of the ILP and MLP available. Similarly on the sorts of "limit
study" machines which gave 10 IPC for gcc (perfect L1 cache, perfect branch
prediction, etc).

Thinking back, it was never really clear how we were supposed to hit 10 IPC.
It was described as the Holy Grail. I think that analogy is more accurate
than we imagined. The idea was build it and it will come. Just keep making
the machine bigger and wider, and you'd get there. Throw in more widgets
(trace cache, value prediction), throw in other widgets to make it
implementable (clustering). It was more about the journey than any real
plan.

> I once heard a famous OOO professor diss a paper submission that was about
> a
> neat trick to mae 2-wide superscalar run faster, saying that we should
> reject all papers not 16-wide. Since I knew that Willamette was
> effectively
> quadrupling the clock frequency, I felt that even narrow OOO was
> interesting.
> Not fundamental, but interesting. While IPC is not fundamntal either.

Hey! Peer review rants are a totally different subject! Ah, the stories I
could tell!

Ned


From: nedbrek on
Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4BED6F7D.9040001(a)patten-glew.net...
> On 5/14/2010 5:31 AM, nedbrek wrote:
>
> And then the wheel of reincarnation will turn again, e.g. there will be a
> power
> wall as we try to get to contact-lens PCs, and the sawtooth will repeat
> itself.

Ow, ow, ow. You are hurting my head.

>> Perhaps. But it can be shown that even academics (and you can lump
>> groups
>> like MRL into "academics") are not doing uarch research anymore (compare
>> the
>> proceedings of MICRO or ISCA from 2003 and 2009).
>
> But: when have academics ever really done advanced microarchitecture
> research? Where did OOO come from: well, actually, Yale Patt and his
> students kept OOO alive with HPSm, and advanced it. But it certainly was
> not a
> popular academic field, it was a single research group. The hand-me-down
> stories is that OOO was discriminated against for years, and had to
> publish at
> second tier conferences.

I know CMU worked on a uop scheme for VAX. Do you know if Bob Colwell was
exposed to that while doing his graduate work? That would seem like a
pretty big boost from academia... I can imagine this work getting panned in
the journals.

Jim Smith did 2bit branch predictors, Wikipedia says while at CDC. I'm so
used to him being an academic... Yale and Patt take credit for the gshare
scheme, although that looks like independent invention.

Rotenberg gets credit for trace cache. I don't think we can blame the
failure of the P4 trace cache on him... It seemed like pretty popular stuff
in most circles.

Transactional memory was probably best described as a mix of academic and
industry research (MRL).

But you are mostly right. Academics are no more immune to bandwagon fever
than industry. It is worse, because academics are not constrained by
economics or even (depending on their models) physics.

Ned