From: Robert Myers on
On Apr 20, 10:11 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:

>
> Register windows: questionable value.
>
> However, multiple register contexts and variable number of registers accessible - the GPUs seem to be demonstrating good
> value. Per thread.

It seemed intuitively obvious to me that you couldn't really optimize
across procedure boundaries the way that Itanium needed to if you were
forever saving and restoring registers.

Maybe my intuition is wrong.

The cost of the design--lots of registers, lots of complexity, is the
part that I not only can't see but also can't even estimate.

Robert.
From: nmm1 on
In article <bb5bef64-00c3-46be-8639-28efc338a8f6(a)k41g2000yqf.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>On Apr 23, 3:06=A0pm, Rick Jones <rick.jon...(a)hp.com> wrote:
>> Anton Ertl <an...(a)mips.complang.tuwien.ac.at> wrote:
>> > A more frequent situation in practice is probably when the compiler
>> > does not know what will happen at run-time; but that's irrelevant,
>> > because it does not happen with SPEC CPU, because that uses profile
>> > feedback.
>>
>> SPECcpu2006 explicitly disallows PBO in base and only allows it in
>> peak. That was a change from SPECcpu2000, which allowed PBO in both.
>>
>That just forces you to design the compiler around the benchmarks.

An apocryphal story is that one compiler (back in the days when Linpack
ruled) checked for the code being Linpack, and replaced it with some
hand-tuned assembler. The rules were changed to forbid that!

>A realistic set of rules for profile-based optimization would ask: how
>much predictability does this code have in practice and how well do
>the compiler and processor exploit it?
>
>Hard to answer such a question in the world of what Nick calls
>benchmarketing.

I didn't invent the term. It goes back to at least 1989.


Regards,
Nick Maclaren.
From: Robert Myers on
On Apr 22, 7:30 am, "nedbrek" <nedb...(a)yahoo.com> wrote:
> Hello,
>
> "Stefan Monnier" <monn...(a)iro.umontreal.ca> wrote in message
>
> news:jwvljcg4z20.fsf-monnier+comp.arch(a)gnu.org...
>
> >> One of the nice things about register rotation is that it almost removes
> >> the
> >> need for the compiler to make a decision: optimize this loop by unrolling
> >> it
> >> and software pipelining it, or not? Register rotation makes that
> >> optimization decision much simpler.
>
> > But does this warrant support in the architecture?  My understanding is
> > that this can only be applied to loops where software pipelining can be
> > used, and these tend to be fairly short anyway, right? so unrolling them
> > a little and adding some startup/cleanup shouldn't be too costly (as
> > long as you have enough registers).
>
> The irony in Itanium was that the compiler would only use software
> pipelining in floating point code (i.e. short code segments).  I think the
> memcpy in libc used it too.  That accounted for the only times I saw it in
> integer code.

Sifting through the comments, and especially yours, I wonder if a
candidate pair of smoking guns is that the visible register set was
too large and/or that the register stack engine never worked the way
it was supposed to (perhaps--and I sure don't know--because of
problems with Microsoft software).

Having so many visible registers had to have increased the complexity
of so many things, one of which, the ALAT, you mentioned in another
post.

If the RSE didn't really work the way it was supposed to, then there
would have been a fairly big downside to aggressive use of a large
number of registers in any given procedure, thus limiting software
pipelining to short loops.

Robert.

From: Stephen Fuld on
On 4/22/2010 4:25 AM, nedbrek wrote:
> Hello,
>
> "Brett Davis"<ggtgp(a)yahoo.com> wrote in message
> news:ggtgp-314A9F.21503721042010(a)news.isp.giganews.com...
>> In article<hqmoq8$5mt$1(a)news.eternal-september.org>,
>>
>> ALAT:
>> http://en.wikipedia.org/wiki/Advanced_Load_Address_Table
>>
>> So this was a special sidecar cache that held 32 long words?
>
> An ALAT load was a real load. It would also place the load address into a
> table. This table had to be snooped by every store (also, bus stores,
> IIRC). A store would set the invalid bit. Then, the check would check the
> bit and redo the load (or chk.a would branch). Also, if the entry
> disappeared (because of LRU or a table flush) it was treated as invalid.

Yes. I thought that this was intended to counter C's aliasing problems.
The compiler could assume no aliasing and rely on the ALAT to detect
any aliasing. But my recollection may be wrong. :-(


--
- Stephen Fuld
(e-mail address disguised to prevent spam)
From: MitchAlsup on
On Apr 21, 1:29 am, an...(a)mips.complang.tuwien.ac.at (Anton Ertl)
wrote:
<big snip>
> Meanwhile, IBM showed with Power6 in 2007 that in-order processors can
> be clocked higher.  

Seymore Cray (along wiht Jim Thornton) also showed that latch based
pipelines (as opposed to register based pipelines) and Scoreboarded
OoO microarchitectures could be clocekd much faster than reservation
station OoO microarchitectures.

Indeed, leaving aside the latch versus register argument for now: A
scoreboard can be fundamentally cycled in approximately 5-6 gate
delays, whereas I know of no reservation station that can be clocked
faster than 12-gate delays.

Mitch