From: Robert Myers on
On Apr 23, 3:06 pm, Rick Jones <rick.jon...(a)hp.com> wrote:
> Anton Ertl <an...(a)mips.complang.tuwien.ac.at> wrote:
> > A more frequent situation in practice is probably when the compiler
> > does not know what will happen at run-time; but that's irrelevant,
> > because it does not happen with SPEC CPU, because that uses profile
> > feedback.
>
> SPECcpu2006 explicitly disallows PBO in base and only allows it in
> peak. That was a change from SPECcpu2000, which allowed PBO in both.
>
That just forces you to design the compiler around the benchmarks.

A realistic set of rules for profile-based optimization would ask: how
much predictability does this code have in practice and how well do
the compiler and processor exploit it?

Hard to answer such a question in the world of what Nick calls
benchmarketing.

Robert.
From: Terje Mathisen "terje.mathisen at on
Anton Ertl wrote:
> 2) The people from the Daisy project at IBM came up with a software
> scheme that makes something like ALAT unnecessary (but may need more
> load units instead): For checking, just load from the same address
> again, and check if the result is the same. I guess that hardware
> memory disambiguation could use the same approach, but maybe the
> ALAT-like memory disambiguator is cheaper than the additional cache
> ports and load units (then it would also be a with for software memory
> disambiguation).

This only works for a single level of load, otherwise you end up with
the ABA problem.

I.e. you'll need to do the check on every single subsequent/dependent
load as well.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on
On 4/21/2010 8:21 AM, Stefan Monnier wrote:
>> One of the nice things about register rotation is that it almost removes the
>> need for the compiler to make a decision: optimize this loop by unrolling it
>> and software pipelining it, or not? Register rotation makes that
>> optimization decision much simpler.
>
> But does this warrant support in the architecture? My understanding is
> that this can only be applied to loops where software pipelining can be
> used, and these tend to be fairly short anyway, right? so unrolling them
> a little and adding some startup/cleanup shouldn't be too costly (as
> long as you have enough registers). If register pressure is a problem,
> you can't unroll enough and you need to add register moves (which
> basically perform the rotation by hand).
> Wouldn't it be preferable (and just as easy/easier) to handle register-move
> instructions efficiently?


While I have worked on, and advocated, handling reg-reg move instructions efficiently, this introduces a whole new level
of complexity.

Specifically, MOVE elimination, changing

lreg2 := MOVE lreg1
lreg3 := ADD lreg2 + 1

into something like

preg2 := MOVE preg1 // eliminated, or ...
preg3 := ADD preg1 + 1

requires that you do some form of reference counting or garbage collection for registers - to track that both lreg1 and
lreg2 map to preg1.

While doable, it's a chunk of omplexity.

Observe that the nice thing about register rotation is that it is a permutation. No reference counts.

And it operates on a lot of registers all at the same time. No arbitrary limits of "at most 2 MOVes may be handled in a
cycle.".

From: "Andy "Krazy" Glew" on
On 4/22/2010 8:16 AM, Robert Myers wrote:

> Having so many visible registers had to have increased the complexity
> of so many things, one of which, the ALAT, you mentioned in another
> post.


Hardware complexity, hell:

The ALAT made single threaded code non-deterministic. You could get different bugs depending on the load average on the
machine.

That's stupid.
From: nedbrek on
Hello all,

"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
news:2010Apr23.153819(a)mips.complang.tuwien.ac.at...
> "nedbrek" <nedbrek(a)yahoo.com> writes:
>>2) chk.a is too expensive. You suffer a branch mispredict penalty, plus
>>you
>>probably miss in the L1I (recovery code is rarely used, therefore rarely
>>cached).
>
> If the recovery code is rarely used, why is the branch mispredicted?
> And why do you suffer a miss in the I-cache? In the usual case the
> branch prediction will be correct (no mispredict penalty), and the
> recovery code will not be performed (no I-cache miss).

The code is going to look like:
ld r1 = a
add = r1
sub = r1
....
chk.a r1, fixup
....
<a long way away>
fixup:
ld r1 = a
add = r1
sub = r1
jmp back

The chk is the branch (if r1 has been invalidated, jump to a section which
will redo the dependent instructions). If the load is always invalid, the
branch can be predicted correctly - but then you always do the work twice ->
lower performance.

If the load is infrequently invalidated, you probably can't predict it ->
branch mispredict and I cache miss.

>>3) Using whole program analysis, compilers got a lot better at static
>>alias
>>detection.
>
> Yes, SPEC CPU is everything that counts. The applications that use
> dynamic linking and that have to build in finite time, and are not
> written in the subset of C that's supported by the whole-program
> analyser (which is not used anyway because the rebuilds take too
> long), i.e., most of the real-world applications, they are irrelevant.

Sure, this is Itanium. Spec was all we had for a long time (most of the
architecture committee used 9-queens), and we were glad to have it.

Ned