From: nedbrek on
Hello all,
Is there some unwritten rule that all threads will become Itanium threads
if they run long enough? :)

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-8BCC50.22254218042010(a)news.isp.giganews.com...
>
> Intriguing, could you elaborate. (Bear in mind I would like to know the
> good points in Itanium, despite the mocking of Itanic.)

There are good points, I will try to list them here...

> I look at Section 2.6.1/2.6.2 and I see something similar to the PowerPC
> pre-load hint. (A no-harm data preload hint, could be great, is useless.)
>
> The "acheck" and "use" stuff makes me go: What!?! Are you serious!?!

Hehe, the ALAT was a disaster.

> 2.6.3 Predication, ARM has this. Seems to hurt clock rate?

1) Predication is a pain for _when_ (not if) you go out-of-order. The
Itanium people never could be made to understand this.
2) Way to many registers -> too many bits lost per instruction -> big fat
instructions -> lower instruction density -> wasted icache
3) The parallel compare instructions which just hurt everyone's brains

Better to have a robust set of cmov (is there no conditional store in x86?)

> 2.7: Register Windows, Spark has this, cackle. ;)

This is the golden child, and often overlooked. The Itanium windows were
much better than Sparc, because they are variable sized (lower chance of
spilling at leaf nodes). Arguments can be passed to functions with no loads
and stores. If you look at a function call, the instructions before it are
a bunch of stores, then the function starts, it stores a bunch of regs
(which had all the values you want), then loads all those values. Itanium
skips all that (and load/store bandwidth is a key bottleneck).

The mistake was to expose all 128 (96 stacked) registers to software (back
to bloated instructions). A visible stack of 16 or 32 would of been fine -
even 8 could be made to work with most ABIs.

> 2.8: Branch hints could be nice. Will add a form to CLIW.

Branch predictors are very good. The hint will only be used the first time
the branch is seen, and it may warm the predictor for the second time
around. Static prediction (back taken, forward not-taken) is pretty good
too. Watch for instruction density.

> 2.9: Register rotation, someone needs to be locked in a rubber room. ;)

Amen.

> It is my opinion that Itanic is a disaster at any speed. ;)

Not listed: bundle templates
This was an attempt to avoid a non-issue (Itanium architects feared the
"instruction dispatch crossbar" - maze of wires). In the end, McKinley
implemented the full crossbar anyway, and it is far worse than any maze of
wires in any OOO.

The templates ended up holding more instruction decode information (see poor
instruction density), which, ironically, drove down instruction density more
when you started adding NOPs due to lack of the proper template. We
consistantly measured 20-33% NOPs on integer code, and up to 50% NOPs in FP
benchmarks.

The ironic thing is that McKinley (Itanium II and the core of all the later
ones) was the ideal implementation of the Itanium architecture. You
couldn't go any wider (the compiler struggled to fill 2 bundles). The L1
latency had to be 1 cycle (the compiler struggled to fill the load delay).
Given those constraints, you were limited in L1 size and clock frequency.
McKinley had pretty optimal dispatch rules and functional unit mix.

There was just nowhere to go, except out-of-order. (At least for more
performance, 1 bundle machines were examined, but Itanium just doesn't have
a low power mode).

Enjoy,
Ned


From: nedbrek on
Hello all,

"Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message
news:f0s2a7-f25.ln1(a)ntp.tmsw.no...
>
> More often in combination: It is the synergy between two or more of those
> features that allow sw to emulate what hw can do more easily, in the form
> of speculative execution, no-fault (prefetch) loads etc.

One of the best quotes from my Itanium days:
"Itanium is an OOO personality trapped in an in-order body."

Ned


From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
nedbrek <nedbrek(a)yahoo.com> wrote:

> The ironic thing is that McKinley (Itanium II and the core of all the later
> ones) was the ideal implementation of the Itanium architecture. You
> couldn't go any wider (the compiler struggled to fill 2 bundles). The L1
> latency had to be 1 cycle (the compiler struggled to fill the load delay).

Doing loads in an earlier pipestage than other instructions was not
considered?

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: nedbrek on
Hello all,

"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
news:2010Apr23.175039(a)mips.complang.tuwien.ac.at...
> Robert Myers <rbmyersusa(a)gmail.com> writes:
>>On Apr 22, 7:30=A0am, "nedbrek" <nedb...(a)yahoo.com> wrote:
>>> The irony in Itanium was that the compiler would only use software
>>> pipelining in floating point code (i.e. short code segments).
>>> I think the memcpy in libc used it too.
>>> That accounted for the only times I saw it in integer code.
>
> Does it have rotation for integer registers?

Yes, unless my memory has completely failed...

> IIRC I read about the hardware for transparent register stack engine
> operation not working, requiring a fallback to exception-driven
> software spilling and refilling. That would not be a big problem on
> most workloads. AFAIK SPARC and AMD29k have always used
> exception-driven software spilling and refilling.

There were multiple modes the RSE could be put into. The most agressive
would load and store registers asynchronously. This is the mode which is
much lamented and never used (I think it could be made to work, but never
studied it).

I am unaware of any problems with the least agressive mode (load/store on
demand, no exception needed). That is what we used in our simulators. Of
course, if there were hardware problems, I wouldn't have known...

Ned



From: Stefan Monnier on
> One of the nice things about register rotation is that it almost removes the
> need for the compiler to make a decision: optimize this loop by unrolling it
> and software pipelining it, or not? Register rotation makes that
> optimization decision much simpler.

But does this warrant support in the architecture? My understanding is
that this can only be applied to loops where software pipelining can be
used, and these tend to be fairly short anyway, right? so unrolling them
a little and adding some startup/cleanup shouldn't be too costly (as
long as you have enough registers). If register pressure is a problem,
you can't unroll enough and you need to add register moves (which
basically perform the rotation by hand).
Wouldn't it be preferable (and just as easy/easier) to handle register-move
instructions efficiently?


Stefan