From: Anton Ertl on
Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
>Anton Ertl wrote:
>> 2) The people from the Daisy project at IBM came up with a software
>> scheme that makes something like ALAT unnecessary (but may need more
>> load units instead): For checking, just load from the same address
>> again, and check if the result is the same. I guess that hardware
>> memory disambiguation could use the same approach, but maybe the
>> ALAT-like memory disambiguator is cheaper than the additional cache
>> ports and load units (then it would also be a with for software memory
>> disambiguation).
>
>This only works for a single level of load, otherwise you end up with
>the ABA problem.

What do you mean with "level of load"?

And what do you mean with ABA problem? What I understand as ABA
problem is not a problem here: If the speculative load loads the right
value, that value and any computation based on that will be correct
even if the content of memory location changes several times between
the speculative load and the checking load.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Anton Ertl on
"nedbrek" <nedbrek(a)yahoo.com> writes:
>2) chk.a is too expensive. You suffer a branch mispredict penalty, plus you
>probably miss in the L1I (recovery code is rarely used, therefore rarely
>cached).

If the recovery code is rarely used, why is the branch mispredicted?
And why do you suffer a miss in the I-cache? In the usual case the
branch prediction will be correct (no mispredict penalty), and the
recovery code will not be performed (no I-cache miss).

>3) Using whole program analysis, compilers got a lot better at static alias
>detection.

Yes, SPEC CPU is everything that counts. The applications that use
dynamic linking and that have to build in finite time, and are not
written in the subset of C that's supported by the whole-program
analyser (which is not used anyway because the rebuilds take too
long), i.e., most of the real-world applications, they are irrelevant.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: nmm1 on
In article <2010Apr23.153819(a)mips.complang.tuwien.ac.at>,
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
>"nedbrek" <nedbrek(a)yahoo.com> writes:
>
>>3) Using whole program analysis, compilers got a lot better at static alias
>>detection.
>
>Yes, SPEC CPU is everything that counts. The applications that use
>dynamic linking and that have to build in finite time, and are not
>written in the subset of C that's supported by the whole-program
>analyser (which is not used anyway because the rebuilds take too
>long), i.e., most of the real-world applications, they are irrelevant.

Or where they make heavy use a library that is not distributed as
source!


Regards,
Nick Maclaren.
From: Terje Mathisen "terje.mathisen at on
Robert Myers wrote:
> Brett Davis wrote:
>
>> But ultimately is not register windowing just a horrid complex slow
>> way to get more register bits, in a fix width instruction set?
>
> Not in the case of Itanium, which has tons of registers.
>
> The purpose, as I understand it, is to permit more seamless operation
> across procedure calls.

I think both were supposed to be important:

Using rotating regs, along with predicated/masked execution, made it
very natural to write almost naive loops, with zero (visible) unrolling,
that still managed to match both L2 load delays and fp latencies, and
got rid of all the normal startup/cleanup code paths.

This did save quite a bit of instruction space.

On the other hand rotation did indeed make it very cheap to do a limited
number of relatively shallow (not too many parameters) function calls,
avoiding much of the need for inlining which can also be responsible for
code bloat.

On the gripping hand, the async register save/restore engine was
supposed to make the limited depth of the register stack completely
transparent to programmers, and this is the feature Nick have lambasted
the most, at least 100+ times over the last decade. :-(

Full disclosure:

When I read the original asm manual and started looking into ways to
wrap my code around the architecture, I really liked it!

With the targeted speeds, it seemed obvious that it could indeed deliver
very high performance for my handcoded asm.

What neither I nor Intel/HP seemed to understand at the time was that
the chip would end up ~5 years late, while still running at the
originally targeted speed: Too little, too late. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Anton Ertl on
"nedbrek" <nedbrek(a)yahoo.com> writes:
>Hello all,
>
>"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
>> 1) OOO CPUs that try to reorder loads above stores tend to have
>> something like an ALAT themselves (I found some slides on the Core
>> microarchitecture that call it "memory disambiguation"); actually,
>> it's usually even more involved, because it contains an alias
>> predictor in addition to the checker.
>
>Sure, they exposed a hardware structure to software. Of course, software
>has to handle all cases in general, where hardware only has to handle the
>case at hand. That means software is going to be conservative (not using it
>all the time, and adding extra flushes).

Yes, in IA-64 the compiler does the prediction of how often a given
store aliases with a given load, so the hardware does not need a
separate predictor for that. And yes, if a load aliases with one of
the later stores several times in a row, and then does not alias
several times in a row, that's a situation where the dynamic hardware
solution will be better than the static compiler solution; but how
often does that happen?

A more frequent situation in practice is probably when the compiler
does not know what will happen at run-time; but that's irrelevant,
because it does not happen with SPEC CPU, because that uses profile
feedback. The conservative approach for using the ALAT seems to me to
use it if it offers a latency advantage when there is no alias. In
contrast, you imply that it is used less often if the compiler does
not know enough; why?

What extra flushes do you mean?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html