From: Andy 'Krazy' Glew on
On 5/28/2010 1:56 AM, Phil Carmody wrote:
> "nedbrek"<nedbrek(a)yahoo.com> writes:
>> Hello all,
>>
>> "Skybuck Flying"<IntoTheFuture(a)hotmail.com> wrote in message
>> news:40e23$4bfe029f$54190f09$16210(a)cache2.tilbu1.nb.home.nl...
>>> 1. Why does Delphi replace the shl 2 with two adds like that ?!?
>>
>> On the first P4, add was 1/2 cycle (2 dependent adds in 1 clock). On other
>> machines, adders are often more available than shifters (being cheaper).
>> Sometimes, shifts cost more than one cycle. Probably, the compiler has a
>> bias towards P4.
>
> There are few things cheaper than a shifter. A barrel shifter's
> more expensive, but for stuff like LEA logic, you don't need a
> barrel shifter. LEA was the usual replacement for things like
> shl 2, I don't know what Delphi was thinking. Of course, on the
> First P4s, the LEA logic used the shifter in the ALU, it magically
> became less useful again, but that's intel for you.


But notice: a 1-bit right shifter is cheap. Wires only cross 1 bit position, which they have to do anyway for any form
of add, redundant or non-redundant, that I am aware of. A 1-bit left shifter is a bit more expensive, since wires have
to go backward one bit position. A 2-bit shifter is more expensive, if you want to do it in one cycle without pumping
data twice over those 1-bit position crossing wires that you already have. Multi-bit shifters are even worse.

The Willamette mindset was that cost was all about wires. Or, at least, that was the lesson I took from Willamette.
Adds are cheap because wires only have to cross one bit position, if you are using redundant form. Or, if you want to
do a carry propagate add, you can break up an add into multiple clock cycles, e.g. calculating the low 16 bits in 1
clock with a CPA, the high the next with a CPA, propagating the carry from low to high.

(Anecdote: I suggested that Mike Haertel work with Sager to use redundant add in the Willamette datapath. Instead, in
a little while he had designed what might have been at that time the world's fasttest 16-bit carry-propagate add. I
think the only redundant arithmetic ideas that remained in Willamette were what we now call sum-addressed-memory. I'll
always regret that Intel did not file a patent or publish that when it was invented in 1991, rather than letting Sun
beat us.)

Plus, you pretty much have to have adds.

Multibit shifts are more expensive, as explained above. Sure, the logic is more trivial than add - shifts are just
muxes (conceptually), whereas even redundant adds need XORs or the moral equivalent. But the wiring for a redundant
add, the wires that cross datapath bit-lanes, is less.

Comparisons are more expensive than add. Most comparisons need a carry propagate. Equals/not-equals don't need a carry
propagate, they just need the moral equivalent of a multi-bit and, but even that is more expensive than a redundant add.

(I tried to develop an abstract model of this Willamette era wiring cost model, but I admit I got stumped. It's
straightforward enough to say that wires, length thereof, crossing the datapath is most important. But it's harder to
have a model where equality, with an N-bit wide AND, is cheaper than LT, with an N-bit wide carry-propagate, since both
have wires that are N-bits wide. It's like "some N-bit wide operations are better than others". Clearly the number of
logic stages is the tie breaker. But it is less elegant to mix the two up. It's almost as if the ideal model is cost =
tuple(wire length,logic depth), rather than some function cost = W*wirelength + L*logicdepth. Historically, most folks
designed with L >> W, but in Willamette W >> L. But its hard to balance two factors at the same time, so tuple(wire
length,logic depth) is a good starting point for Willamette era designs.)