Itanium had appeal [Computer Architecture]

Prev: Interesting: Is IBM considering an OS/2 redo?
Next: arch ne age

From: Anton Ertl on 22 Apr 2010 14:30

"nedbrek" <nedbrek(a)yahoo.com> writes:
[ALAT]
>The biggest problem was building this huge table (potentially one entry for
>every register), and snooping it with all your store traffic - power and
>area. All to get some tiny performance increase.

Hmm, my impression is that memory dependencies are quite a big
obstacle to getting good performance from scheduling. I have no
numbers at hand to support that, though. OTOH, do you have numbers
that support your claim?

>The idea was to boost loads above stores. You might also boost code
>dependent on the store (thus the chk.a form). All part of the "let's do OOO
>in software".

Not really:

1) OOO CPUs that try to reorder loads above stores tend to have
something like an ALAT themselves (I found some slides on the Core
microarchitecture that call it "memory disambiguation"); actually,
it's usually even more involved, because it contains an alias
predictor in addition to the checker.

2) The people from the Daisy project at IBM came up with a software
scheme that makes something like ALAT unnecessary (but may need more
load units instead): For checking, just load from the same address
again, and check if the result is the same. I guess that hardware
memory disambiguation could use the same approach, but maybe the
ALAT-like memory disambiguator is cheaper than the additional cache
ports and load units (then it would also be a with for software memory
disambiguation).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: Anton Ertl on 22 Apr 2010 15:04

"nedbrek" <nedbrek(a)yahoo.com> writes:
>Hello,
>
>""Niels J�rgen Kruse"" <nospam(a)ab-katrinedal.dk> wrote in message
>news:1jhaz39.dut3im1rsu9bkN%nospam(a)ab-katrinedal.dk...
>> nedbrek <nedbrek(a)yahoo.com> wrote:
>>
>>> The ironic thing is that McKinley (Itanium II and the core of all the
>>> later
>>> ones) was the ideal implementation of the Itanium architecture. You
>>> couldn't go any wider (the compiler struggled to fill 2 bundles). The L1
>>> latency had to be 1 cycle (the compiler struggled to fill the load
>>> delay).
>>
>> Doing loads in an earlier pipestage than other instructions was not
>> considered?
>
>Not entirely sure what you're suggesting...
>
>If the load takes two cycles, then any arrangement of pipestages is going to
>take 4 cycles to do "add-load-add".
>
>Itanium didn't have the load-op-store addressing mode (e.g. x86) which can
>make a load/execute pipeline make sense (e.g. Atom).

There is at least one paper around that claims that moving execute
down one stage to make load-use latency shorter (at the cost of higher
ALU-Load latencies and higher ALU-Branch latencies) is a win even in
load/store architectures. And at least one of the MIPS
implementations (R8000? R10000?) actually had that arrangement.

For IA-64, there is the additional problem that it does not have a
reg+const addressing mode, so I guess it will see more ALU-load
dependencies than most other architectures; this can change the
balance.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: nedbrek on 23 Apr 2010 07:40

Hello all,

"Stephen Fuld" <SFuld(a)alumni.cmu.edu.invalid> wrote in message
news:hqpquv$u06$1(a)news.eternal-september.org...
> On 4/22/2010 4:25 AM, nedbrek wrote:
>> An ALAT load was a real load. It would also place the load address into
>> a
>> table. This table had to be snooped by every store (also, bus stores,
>> IIRC). A store would set the invalid bit. Then, the check would check
>> the
>> bit and redo the load (or chk.a would branch). Also, if the entry
>> disappeared (because of LRU or a table flush) it was treated as invalid.
>
> Yes. I thought that this was intended to counter C's aliasing problems.
> The compiler could assume no aliasing and rely on the ALAT to detect any
> aliasing. But my recollection may be wrong. :-(

Yes, exactly. The ALAT failed for a number of reasons:
1) A load hit costs 1 cycle. The ld.chk takes 1 cycle, and must be executed
before any dependent instructions (assuming you're not going the chk.a
route). So, in the simple case there isn't much to gain. If you're missing
in the L1, the in-order pipe is going to shut down anyway.

2) chk.a is too expensive. You suffer a branch mispredict penalty, plus you
probably miss in the L1I (recovery code is rarely used, therefore rarely
cached). That destroys any advantage to executing a few dependent
instructions.

3) Using whole program analysis, compilers got a lot better at static alias
detection.

Ned

From: nedbrek on 23 Apr 2010 07:49

Hello all,

"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
news:f507c9d1-6f5b-4d72-963d-df25178b1fcc(a)g11g2000yqe.googlegroups.com...
>On Apr 22, 7:30 am, "nedbrek" <nedb...(a)yahoo.com> wrote:
>> The irony in Itanium was that the compiler would only use software
>> pipelining in floating point code (i.e. short code segments). I think the
>> memcpy in libc used it too. That accounted for the only times I saw it in
>> integer code.
>
> Sifting through the comments, and especially yours, I wonder if a
> candidate pair of smoking guns is that the visible register set was
> too large and/or that the register stack engine never worked the way
> it was supposed to (perhaps--and I sure don't know--because of
> problems with Microsoft software).

The visible register set was definitely too large, at least on the integer
side (I'm not a big FP guy). I believe that async RSE could be made to
work, although I didn't look into it too much. You get a lot of the benefit
even with the simplest engine.

> Having so many visible registers had to have increased the complexity
> of so many things, one of which, the ALAT, you mentioned in another
> post.

Yes, the register file is big and power hungry. Also, it makes the renamer
huge. Also, it chews up a lot of instruction bits (7 bits per op, 4 op
FMA -> 28 bits -> 32 bits per op is not enough -> bundles -> templates ->
suffering [thank you Yoda!])

> If the RSE didn't really work the way it was supposed to, then there
> would have been a fairly big downside to aggressive use of a large
> number of registers in any given procedure, thus limiting software
> pipelining to short loops.

I don't think anyone in the compiler group cared about the functioning of
the RSE. They just couldn't use the regs. The ILP wasn't there, or they
couldn't be sure the code wouldn't be buggy.

Ned

From: nedbrek on 23 Apr 2010 07:53

Hello all,

"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
news:2010Apr22.203003(a)mips.complang.tuwien.ac.at...
> "nedbrek" <nedbrek(a)yahoo.com> writes:
> [ALAT]
>>The biggest problem was building this huge table (potentially one entry
>>for
>>every register), and snooping it with all your store traffic - power and
>>area. All to get some tiny performance increase.
>
> Hmm, my impression is that memory dependencies are quite a big
> obstacle to getting good performance from scheduling. I have no
> numbers at hand to support that, though. OTOH, do you have numbers
> that support your claim?

It's been a while, but the Merom guys got a nice chunk (maybe 5%) from the
ld-st disambiguator. 21264 had one too. The ALAT was worth maybe 1%. See
my response to Stephen on why it failed.

>>The idea was to boost loads above stores. You might also boost code
>>dependent on the store (thus the chk.a form). All part of the "let's do
>>OOO
>>in software".
>
> Not really:
>
> 1) OOO CPUs that try to reorder loads above stores tend to have
> something like an ALAT themselves (I found some slides on the Core
> microarchitecture that call it "memory disambiguation"); actually,
> it's usually even more involved, because it contains an alias
> predictor in addition to the checker.

Sure, they exposed a hardware structure to software. Of course, software
has to handle all cases in general, where hardware only has to handle the
case at hand. That means software is going to be conservative (not using it
all the time, and adding extra flushes).

Ned

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: Interesting: Is IBM considering an OS/2 redo?
Next: arch ne age