From: nedbrek on
Hello all,

"Robert Myers" <rbmyersusa(a)> wrote in message
> Anton Ertl wrote:
>> Robert Myers <rbmyersusa(a)> writes:
>>> Sifting through the comments, and especially yours, I wonder if a
>>> candidate pair of smoking guns
>> Smoking guns for what?
> What didn't work the way it was supposed to (maybe the RSE) or what
> feature cost way too much with too little payback (too many architectural
> registers). Or maybe it's what many have implied: what do you expect from
> a design by committee?--in which case there are no smoking guns (the
> lethal shot that killed the world's most amazing processor).

The main "smoking gun" is that instruction set doesn't matter. Installed
base matters a lot, and performance matters. Also, delivering product on
time and in quantity...

Of course, too many registers didn't help. I mentioned this elsewhere, but
I can add:
7 register bits * 4 ops + 6 predicate bits = 34 bit instruction (worst case)
=> no 32 bit instructions
=> bundling
Bundling in itself isn't too bad, you need somewhere to stash dependency

But, Itanium tried to record independence - turns out, determining
dependence is much more important (see Smith's dependency chain processing

Also, some dork tried to "fix" the (non-existant) "dispatch problem", and
ended up messing things up even worse. This lead to:

extra decode info in the bundle template
=> not enough templates
=> lots of 41 bit NOPs
=> poor icache and front-end utilization


From: MitchAlsup on
On Apr 23, 2:06 am, "Andy \"Krazy\" Glew" <ag-n...(a)>
> While I have worked on, and advocated, handling reg-reg move instructions efficiently, this introduces a whole new level
> of complexity.
> Specifically, MOVE elimination, changing
>          lreg2 := MOVE lreg1
>          lreg3 := ADD lreg2 + 1

Ireg2 := MOV Ireg1
Ireg2 := OP Ireg2,<const or reg or mem>

Was a hardware optimization in K9 easily detecteed during trace
building. Two x86 instructions <const or reg> became a single
operation in the trace cache. The memory form became a two op form in
the trace cache. Recognizing Andy's version (Ireg3) leads to the
ability to do the Move elimination found later in this post.

This and branch fusing recognized most of Idium recognition that was
done in that microarchitecture.

A side effect of the Idium recognizer was that: (Move elimination)

temp := MOVE Ireg1
Ireg1 := MOVE Ireg2
Ireg2 := MOVE temp

Would only cause 2 operations in the trace cache. All you had to track
down was that another operation destroyed 'temp' by the end of the
trace boundary.

It was after this kind of realization that I became convinced that 3
register architectural instructions formats waste bits that might be
better expended on other encoding stuff. 3 register microarchitectural
operation formats remain de rigor.

I also recognized that I will never (97% level) get a chance to use
those bits in more profitable endeavors.

From: Brett Davis on
In article <hqmoq8$5mt$1(a)>,
"nedbrek" <nedbrek(a)> wrote:
> "Brett Davis" <ggtgp(a)> wrote in message
> > The "acheck" and "use" stuff makes me go: What!?! Are you serious!?!
> Hehe, the ALAT was a disaster.


So this was a special sidecar cache that held 32 long words?

From a hardware design point dealing with all the special cases would
make that a disaster. (Grep the Itanic manual to see what I mean.)

I assume it had some sort of redeeming benefit, like a load-to-use
delay of fewer cycles?
In fairy dreamland it could load into the register by itself, and
turn the check instruction into a NOP, saving a cycle. ;)

Verses just using a cache prefetch and ordinary load?

From: Robert Myers on
nedbrek wrote:

> Bundling in itself isn't too bad, you need somewhere to stash dependency
> info.
> But, Itanium tried to record independence - turns out, determining
> dependence is much more important (see Smith's dependency chain processing
> research).

The paper I found

An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing
Ho-Seop Kim and James E. Smith
Department of Electrical and Computer Engineering
University of Wisconsin�Madison

advertises the ability to run at a high clock rate and also proposes
binary translation. This paper was, of course, before the Pentium 4
clock rate debacle, before Transmeta folded, and before power
consumption became an obsession.

That is not to say that the idea may still not have merit. On the face
of it, keeping dependent chains together has the obvious advantage of
increasing locality, so that computation can be efficiently parceled out
over threads in a core, over separate cores on a chip, or even
conceivably over multiple sockets.

From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Anton Ertl <anton(a)> wrote:

> There is at least one paper around that claims that moving execute
> down one stage to make load-use latency shorter (at the cost of higher
> ALU-Load latencies and higher ALU-Branch latencies) is a win even in
> load/store architectures. And at least one of the MIPS
> implementations (R8000? R10000?) actually had that arrangement.

IBM inorder PPC designs do this. Cell PPE skewed 3 cycles and POWER6
skewed 2 cycles for simple integer instructions.

> For IA-64, there is the additional problem that it does not have a
> reg+const addressing mode, so I guess it will see more ALU-load
> dependencies than most other architectures; this can change the
> balance.

Good point.

Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark