From: Terje Mathisen "terje.mathisen at on
Brett Davis wrote:
> I assume it had some sort of redeeming benefit, like a load-to-use
> delay of fewer cycles?
> In fairy dreamland it could load into the register by itself, and
> turn the check instruction into a NOP, saving a cycle. ;)
>
> Verses just using a cache prefetch and ordinary load?

The way I looked upon it, it _was_ a cache prefetch, and it did need a
real load to follow it.

The main difference was in the naming of those two operations, and the
fact that the prefetch specified the target register of the load.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Anton Ertl on
Robert Myers <rbmyersusa(a)gmail.com> writes:
>On Apr 22, 7:30=A0am, "nedbrek" <nedb...(a)yahoo.com> wrote:
>> The irony in Itanium was that the compiler would only use software
>> pipelining in floating point code (i.e. short code segments). =A0I think =
>the
>> memcpy in libc used it too. =A0That accounted for the only times I saw it=
> in
>> integer code.

Does it have rotation for integer registers?

>Sifting through the comments, and especially yours, I wonder if a
>candidate pair of smoking guns

Smoking guns for what?

>is that the visible register set was
>too large and/or that the register stack engine never worked the way
>it was supposed to (perhaps--and I sure don't know--because of
>problems with Microsoft software).

Problems with Microsoft software should be irrelevant on non-Microsoft
platforms.

IIRC I read about the hardware for transparent register stack engine
operation not working, requiring a fallback to exception-driven
software spilling and refilling. That would not be a big problem on
most workloads. AFAIK SPARC and AMD29k have always used
exception-driven software spilling and refilling.

>If the RSE didn't really work the way it was supposed to, then there
>would have been a fairly big downside to aggressive use of a large
>number of registers in any given procedure, thus limiting software
>pipelining to short loops.

Not really, because software pipelining is beneficial mainly for inner
loops with many iterations; if you have that, then any register
spilling and refilling overhead is amortized over many executed
instructions. Of course, all of this depends on the compiler being
able to predict which loops have many iterations. But this is no
problem for SPEC CPU, which uses profile feedback; and of course, SPEC
CPU performance is what's relevant.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: nedbrek on
Hello,

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-314A9F.21503721042010(a)news.isp.giganews.com...
> In article <hqmoq8$5mt$1(a)news.eternal-september.org>,
>
> ALAT:
> http://en.wikipedia.org/wiki/Advanced_Load_Address_Table
>
> So this was a special sidecar cache that held 32 long words?

An ALAT load was a real load. It would also place the load address into a
table. This table had to be snooped by every store (also, bus stores,
IIRC). A store would set the invalid bit. Then, the check would check the
bit and redo the load (or chk.a would branch). Also, if the entry
disappeared (because of LRU or a table flush) it was treated as invalid.

> From a hardware design point dealing with all the special cases would
> make that a disaster. (Grep the Itanic manual to see what I mean.)

The biggest problem was building this huge table (potentially one entry for
every register), and snooping it with all your store traffic - power and
area. All to get some tiny performance increase.

> I assume it had some sort of redeeming benefit, like a load-to-use
> delay of fewer cycles?
> In fairy dreamland it could load into the register by itself, and
> turn the check instruction into a NOP, saving a cycle. ;)

The idea was to boost loads above stores. You might also boost code
dependent on the store (thus the chk.a form). All part of the "let's do OOO
in software".

> Verses just using a cache prefetch and ordinary load?

Yes, but this prefetches into the register. Hence, importing all the cache
snoop logic into the core.

Ned


From: nedbrek on
Hello,

"Stefan Monnier" <monnier(a)iro.umontreal.ca> wrote in message
news:jwvljcg4z20.fsf-monnier+comp.arch(a)gnu.org...
>> One of the nice things about register rotation is that it almost removes
>> the
>> need for the compiler to make a decision: optimize this loop by unrolling
>> it
>> and software pipelining it, or not? Register rotation makes that
>> optimization decision much simpler.
>
> But does this warrant support in the architecture? My understanding is
> that this can only be applied to loops where software pipelining can be
> used, and these tend to be fairly short anyway, right? so unrolling them
> a little and adding some startup/cleanup shouldn't be too costly (as
> long as you have enough registers).

The irony in Itanium was that the compiler would only use software
pipelining in floating point code (i.e. short code segments). I think the
memcpy in libc used it too. That accounted for the only times I saw it in
integer code.


Ned


From: nedbrek on
Hello,

""Niels J�rgen Kruse"" <nospam(a)ab-katrinedal.dk> wrote in message
news:1jhaz39.dut3im1rsu9bkN%nospam(a)ab-katrinedal.dk...
> nedbrek <nedbrek(a)yahoo.com> wrote:
>
>> The ironic thing is that McKinley (Itanium II and the core of all the
>> later
>> ones) was the ideal implementation of the Itanium architecture. You
>> couldn't go any wider (the compiler struggled to fill 2 bundles). The L1
>> latency had to be 1 cycle (the compiler struggled to fill the load
>> delay).
>
> Doing loads in an earlier pipestage than other instructions was not
> considered?

Not entirely sure what you're suggesting...

If the load takes two cycles, then any arrangement of pipestages is going to
take 4 cycles to do "add-load-add".

Itanium didn't have the load-op-store addressing mode (e.g. x86) which can
make a load/execute pipeline make sense (e.g. Atom).

In fact, IIRC, the only Itanium addressing mode was register indirect (with
optional postincrement).

Ned