From: nmm1 on
In article <udi2a7-hn4.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>On the gripping hand, the async register save/restore engine was
>supposed to make the limited depth of the register stack completely
>transparent to programmers, and this is the feature Nick have lambasted
>the most, at least 100+ times over the last decade. :-(

Not quite - I have said that it was a good idea, in theory, but
they hadn't thought it through. The rumour mill was that they
couldn't get it to work, though I have my suspicions that it was
actually software (Microsoft's?) that they couldn't get to work
when it was turned on.

The feature that I have lambasted most is the register handling in
interrupts (including unwind sections). Now, that was and is an
obvious disaster area, for very well-understood software engineering
reasons.

>Full disclosure:
>
>When I read the original asm manual and started looking into ways to
>wrap my code around the architecture, I really liked it!
>
>With the targeted speeds, it seemed obvious that it could indeed deliver
>very high performance for my handcoded asm.

Oh, yes, THAT was very clear. I didn't try, but am certain that I
could have used it in that way. And history relates that, even with
low clock speeds, it does perform very well for such computational
cores.

>What neither I nor Intel/HP seemed to understand at the time was that
>the chip would end up ~5 years late, while still running at the
>originally targeted speed: Too little, too late. :-(

I knew that it would be late, as the timescale was less than for a
new version of an existing architecture, but did not realise how
late. I knew that the software would be very late, seriously buggy,
or seriously inefficient, though.


Regards,
Nick Maclaren.
From: nedbrek on
Hello all,

"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
news:2010Apr23.173627(a)mips.complang.tuwien.ac.at...
> "nedbrek" <nedbrek(a)yahoo.com> writes:
>>Sure, they exposed a hardware structure to software. Of course, software
>>has to handle all cases in general, where hardware only has to handle the
>>case at hand. That means software is going to be conservative (not using
>>it
>>all the time, and adding extra flushes).
>
> Yes, in IA-64 the compiler does the prediction of how often a given
> store aliases with a given load, so the hardware does not need a
> separate predictor for that. And yes, if a load aliases with one of
> the later stores several times in a row, and then does not alias
> several times in a row, that's a situation where the dynamic hardware
> solution will be better than the static compiler solution; but how
> often does that happen?
>
> A more frequent situation in practice is probably when the compiler
> does not know what will happen at run-time; but that's irrelevant,
> because it does not happen with SPEC CPU, because that uses profile
> feedback. The conservative approach for using the ALAT seems to me to
> use it if it offers a latency advantage when there is no alias. In
> contrast, you imply that it is used less often if the compiler does
> not know enough; why?

It's been a while. It's probably always safe to use ld.a/ld.chk...

> What extra flushes do you mean?

Again, I forget the details. I believe there were sometimes problems when
crossing function boundaries. The ALAT stores the physical register id. If
a subroutine does a ld.a to the same physical register, it may invalidate
both (to be safe). Or, it may have been necessary to flush the table on
call/return.

Ned


From: Terje Mathisen "terje.mathisen at on
nmm1(a)cam.ac.uk wrote:
> In article<udi2a7-hn4.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> With the targeted speeds, it seemed obvious that it could indeed deliver
>> very high performance for my handcoded asm.
>
> Oh, yes, THAT was very clear. I didn't try, but am certain that I
> could have used it in that way. And history relates that, even with
> low clock speeds, it does perform very well for such computational
> cores.

I actually got very worthwhile performance from what used to be branchy
x86 code, but only by also changing algorithms, not a simple
pattern-matching "replace hard-to-predict if/then/else with predicates".

History has shown that compilers, in particular for C(++), don't have
enough information to do the same.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen "terje.mathisen at on
Anton Ertl wrote:
> "Andy \"Krazy\" Glew"<ag-news(a)patten-glew.net> writes:
>> Register windows: questionable value.
>
> Yes, if they are not present, we can use inlining to reduce the call
> overhead. Except when we cannot, as for calls to dynamically-linked
> libraries or polymorphic calls (fortunately, none of that really
> happens for the SPEC CPU benchmarks, and that's all we care about,
> no?).
>
> But wait, doesn't inlining increase code size? Given that IA-64
> detractors claim that code size is a problem already, would leaving
> away the register stack be a good idea? Maybe, if it had allowed the
> implementations to reach higher clock speeds. But would that be the
> case?
>
> All of the features of IA-64 seem to have their value, individually,
> and often also in combination.

More often in combination: It is the synergy between two or more of
those features that allow sw to emulate what hw can do more easily, in
the form of speculative execution, no-fault (prefetch) loads etc.

> I guess (based on little evidence) it's the accumulation of these
> features that is the problem. My guess is that the implementors had
> their hands (or rather heads) full with the architectural features, so
> their was no time to invent or adapt the tricks that led to
> fast-clocked IA-32 implementations at the same time, at least not in
> the first two iterations of the architecture; and afterwards Intel
> seems to have given up on it; they do little more than shrink McKinley
> to new processes.
>
> Meanwhile, IBM showed with Power6 in 2007 that in-order processors can
> be clocked higher. But that was 17 years after the Power architecture
> was introduced in 1990, and up to the Power4 in 2001 all of the Power
> implementations had been on the slow-clocked side.
>
> So maybe with enough attempts and enough effort at each attempt
> Intel/HP could produce an IA-64 implementation that's fast (whether by
> being in-order with a very fast clock or out-of-order with just a fast
> clock). But I guess it would not increase revenue from the
> architecture much (performance did not help Alpha), so the current
> course of Intel and HP appears economically sensible.

Sensible, yeah absolutely.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <5qr2a7-825.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>>> With the targeted speeds, it seemed obvious that it could indeed deliver
>>> very high performance for my handcoded asm.
>>
>> Oh, yes, THAT was very clear. I didn't try, but am certain that I
>> could have used it in that way. And history relates that, even with
>> low clock speeds, it does perform very well for such computational
>> cores.
>
>I actually got very worthwhile performance from what used to be branchy
>x86 code, but only by also changing algorithms, not a simple
>pattern-matching "replace hard-to-predict if/then/else with predicates".

I tried pencil-and-paper experiments on actual code, a long time ago,
and came to the conclusion that predicates were useful as a way of
extending the instruction set, but not for handling genuinely branchy
code. I.e. they could remove only (some of) the branches introduced
by the compiler+architecture, and not ones that were in the actual
program logic.

The Fortran DIM intrinsic, and kludging up some of the ways that
IEEE 754 is specified, are obvious examples of what they are good
for. Operating on graph data structures is an obvious example of
what they can't handle.

>History has shown that compilers, in particular for C(++), don't have
>enough information to do the same.

Precisely. And that is why I said that the compiler project wouldn't
fly - that was well-known long before the architecture was designed.


Regards,
Nick Maclaren.