From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>Register windows: questionable value.

Yes, if they are not present, we can use inlining to reduce the call
overhead. Except when we cannot, as for calls to dynamically-linked
libraries or polymorphic calls (fortunately, none of that really
happens for the SPEC CPU benchmarks, and that's all we care about,
no?).

But wait, doesn't inlining increase code size? Given that IA-64
detractors claim that code size is a problem already, would leaving
away the register stack be a good idea? Maybe, if it had allowed the
implementations to reach higher clock speeds. But would that be the
case?

All of the features of IA-64 seem to have their value, individually,
and often also in combination.

I guess (based on little evidence) it's the accumulation of these
features that is the problem. My guess is that the implementors had
their hands (or rather heads) full with the architectural features, so
their was no time to invent or adapt the tricks that led to
fast-clocked IA-32 implementations at the same time, at least not in
the first two iterations of the architecture; and afterwards Intel
seems to have given up on it; they do little more than shrink McKinley
to new processes.

Meanwhile, IBM showed with Power6 in 2007 that in-order processors can
be clocked higher. But that was 17 years after the Power architecture
was introduced in 1990, and up to the Power4 in 2001 all of the Power
implementations had been on the slow-clocked side.

So maybe with enough attempts and enough effort at each attempt
Intel/HP could produce an IA-64 implementation that's fast (whether by
being in-order with a very fast clock or out-of-order with just a fast
clock). But I guess it would not increase revenue from the
architecture much (performance did not help Alpha), so the current
course of Intel and HP appears economically sensible.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Terje Mathisen "terje.mathisen at on
Anton Ertl wrote:
> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>> Anton Ertl wrote:
>>> 2) The people from the Daisy project at IBM came up with a software
>>> scheme that makes something like ALAT unnecessary (but may need more
>>> load units instead): For checking, just load from the same address
>>> again, and check if the result is the same. I guess that hardware
>>> memory disambiguation could use the same approach, but maybe the
>>> ALAT-like memory disambiguator is cheaper than the additional cache
>>> ports and load units (then it would also be a with for software memory
>>> disambiguation).
>>
>> This only works for a single level of load, otherwise you end up with
>> the ABA problem.
>
> What do you mean with "level of load"?
>
> And what do you mean with ABA problem? What I understand as ABA
> problem is not a problem here: If the speculative load loads the right
> value, that value and any computation based on that will be correct
> even if the content of memory location changes several times between
> the speculative load and the checking load.

I'm thinking of a multi-level structure where the critical value is a
pointer:

First you load it and get A, then load an item in the block A points at,
then another process comes and does the following:

Load A, process what it points at and free that block. (At this point
A=NULL).

Next the same or yet another process allocates a new block and gets to
reuse the area A used to point to, but this time it is filled by another
set of data, OK?

Finally you are rescheduled, finish the processing you started and do a
compare against the original value of A to make sure it has all been
safe, before committing your updates.

I.e. a single final compare isn't sufficient if the meaning can change,
you have to verify every single item you have loaded that depended upon
that speculatively loaded item.

The ALAT is similar to LLSC in that it will detect all modifications,
including a rewrite of the same value.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Robert Myers on
Anton Ertl wrote:
> Robert Myers <rbmyersusa(a)gmail.com> writes:
>> On Apr 22, 7:30=A0am, "nedbrek" <nedb...(a)yahoo.com> wrote:
>>> The irony in Itanium was that the compiler would only use software
>>> pipelining in floating point code (i.e. short code segments). =A0I think =
>> the
>>> memcpy in libc used it too. =A0That accounted for the only times I saw it=
>> in
>>> integer code.
>
> Does it have rotation for integer registers?
>
>> Sifting through the comments, and especially yours, I wonder if a
>> candidate pair of smoking guns
>
> Smoking guns for what?
>
What didn't work the way it was supposed to (maybe the RSE) or what
feature cost way too much with too little payback (too many
architectural registers). Or maybe it's what many have implied: what do
you expect from a design by committee?--in which case there are no
smoking guns (the lethal shot that killed the world's most amazing
processor).

>> is that the visible register set was
>> too large and/or that the register stack engine never worked the way
>> it was supposed to (perhaps--and I sure don't know--because of
>> problems with Microsoft software).
>
> Problems with Microsoft software should be irrelevant on non-Microsoft
> platforms.
>
Yes, but if problems with Microsoft forced a change of plans, the
resulting loss in performance would have appeared on all platforms.


> IIRC I read about the hardware for transparent register stack engine
> operation not working, requiring a fallback to exception-driven
> software spilling and refilling. That would not be a big problem on
> most workloads. AFAIK SPARC and AMD29k have always used
> exception-driven software spilling and refilling.
>
And that says what about Itanium, which had a completely different set
of priorities? The fact that register spills could be handled
asynchronously meant that you could use registers with reckless
abandon--unless the RSE never worked the way it should have, in which
case you couldn't. Then you had the cost of all those architectural
registers without a commensurate payback.


>> If the RSE didn't really work the way it was supposed to, then there
>> would have been a fairly big downside to aggressive use of a large
>> number of registers in any given procedure, thus limiting software
>> pipelining to short loops.
>
> Not really, because software pipelining is beneficial mainly for inner
> loops with many iterations; if you have that, then any register
> spilling and refilling overhead is amortized over many executed
> instructions. Of course, all of this depends on the compiler being
> able to predict which loops have many iterations. But this is no
> problem for SPEC CPU, which uses profile feedback; and of course, SPEC
> CPU performance is what's relevant.
>
In other words, if Itanium hadn't attempted to embrace a design
philosophy that is still apparently unwelcome to you, there shouldn't
have been a problem. Are you being serious, or are you just jerking my
chain?

Same for your snarky comments about proile-directed optimization. Okay,
you don't like it. We got that.

Robert.
From: Robert Myers on
Robert Myers wrote:
> Anton Ertl wrote:
>> Robert Myers <rbmyersusa(a)gmail.com> writes:
>>> On Apr 22, 7:30=A0am, "nedbrek" <nedb...(a)yahoo.com> wrote:
>>>> The irony in Itanium was that the compiler would only use software
>>>> pipelining in floating point code (i.e. short code segments). =A0I
>>>> think =
>>> the
>>>> memcpy in libc used it too. =A0That accounted for the only times I
>>>> saw it=
>>> in
>>>> integer code.
>>
>> Does it have rotation for integer registers?
>>
>>> Sifting through the comments, and especially yours, I wonder if a
>>> candidate pair of smoking guns
>>
>> Smoking guns for what?
>>
> What didn't work the way it was supposed to (maybe the RSE) or what
> feature cost way too much with too little payback (too many
> architectural registers). Or maybe it's what many have implied: what do
> you expect from a design by committee?--in which case there are no
> smoking guns (the lethal shot that killed the world's most amazing
> processor).
>

It occurred to me, after I sent that off, that maybe Itanium was
*expected* to have a two cycle L1 load delay. If registers are free,
they essentially become another layer of cache. Load (or preload)
everything once. Who cares if it takes two cycles?

Of course, things didn't work out that way.

Robert.
From: Rick Jones on
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
> A more frequent situation in practice is probably when the compiler
> does not know what will happen at run-time; but that's irrelevant,
> because it does not happen with SPEC CPU, because that uses profile
> feedback.

SPECcpu2006 explicitly disallows PBO in base and only allows it in
peak. That was a change from SPECcpu2000, which allowed PBO in both.

rick jones
--
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
- Rick Jones
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...