RISC load-store verses x86 Add from memory. [Computer Architecture]

Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?

From: Andy 'Krazy' Glew on 18 Jun 2010 01:17

On 6/17/2010 7:34 AM, MitchAlsup wrote:
> On Jun 17, 12:33 am, Brett Davis<gg...(a)yahoo.com> wrote:
>> What am I missing.
> Secondarily, once you microarchitect a fully out-of-order processor,
> it really does not mater what the native instruction set is. ReRead
> that previous sentance until you understand. The native instruction
> set no longer maters once the microarchitecture has gone fully OoO!

Well, there are small effects, e.g. in code size. x86 is no paragon of virtue, compactness, or regular instruction
bytes, but you can imagine that a "RISCy" 2-register instruction set might have instructions that look like:

reg1 += reg2

and might fit most instructions in 16 bits, with a 5 bit register field leaving 6 bits for opcodes.

Perhaps with hardware to recognize the instruction sequence

reg1 := reg0; reg1 += reg2

and emit the 3-input operation

reg1 := reg0 + reg2

Given that reg1 += reg2 is much more common than the general form reg1 := reg0 + reg2, there may be a net savings.

I'm not aware of any x86 procesor that does this, but this technique is well known even from before I joined Intel in
1991. At various times it has been called "move elimination".

---

We also have not talked about the possibility of a load-op pipeline, yes, even on an out-of-order CPU. (Anecdote:
Tomasulo took me aside at a conference, and asked me why P6 did not have a load-op pipeline. I knew that we had studied
it; and it has been studied and restudied every 2nd or 3rd processor generation. It usually has insufficient advantage.
I conjecture(d) that the P5 had encouraged non-load-op instructions. Tomasulo pointed out that x86 had many reg,reg
operations, and that this would waste the load- part of a load-op pipeline. My overall feeling, however, is that on a
load-op pipeline you have to handle the possibility of the load missing, so you either have to have a buffer between
load and op, or you have to provide a late read of the register operand of the op. In either case, you have to handle
the possibility of the load-op being decoupled. Or, you could just replay the entire load-op on a cache miss. In any
case, there is wastage - it's the usual centralized versus decentralized buffer issue.)

===

But, Brett was asking "Why RISC a:=b+c?", not "Why a+=b or a+=load(mem)?"

And Mitch has provided the answer. x86 has complicated decoding. Market volumes amortize, but it is still a cost.

All other things being equal, I would rather build a RISC, perhaps a 16-bit a+=b RISC as described above.

But all other things are not equal. Out-of-order is a big leveller. Although, if you wanted to have lots of simple
cores, you might want to give up x86.

From: nedbrek on 18 Jun 2010 07:51

Hello all,

"Stephen Sprunk" <stephen(a)sprunk.org> wrote in message
news:hvev00$f6m$1(a)news.eternal-september.org...
> On 17 Jun 2010 10:24, Anton Ertl wrote:
>> Stephen Sprunk <stephen(a)sprunk.org> writes:
>>> On 17 Jun 2010 00:33, Brett Davis wrote:
>
> The point I was trying to make is that x86 has no 3-operand add
> instruction like the one he used in his example, nor does RISC allow a
> memory address as the destination of an add instruction as he did in his
> example. I corrected both to show a fairer comparison.

lea r1 = &[r2 + r3]

from (the general form):
lea r1 = &[r2 << {0,1,2,3} + r3 + imm]

I don't have any proof that a compiler will actually emit it... :)

Ned

From: nedbrek on 18 Jun 2010 08:13

Hello all,

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-1B263C.00331217062010(a)news.isp.giganews.com...
> RISC load-store verses x86 Add from memory.
> t = a->x + a->y;

This is a really bad choice (as others have shown)...

I don't know about RISC vs. CISC, but if you want to compare "complexity in
the compiler" vs. "smarts in the hardware" - use Itanium:

x86:
div AX /= BX (like 2 or 3 bytes)

Itanium (from the application writers guide, section 13.3.3.1)(min latency,
13 instructions, 7 PIGs):
frcpa.s0 f8.p6 = f6,f7 ;;
(p6) fma.s1 f9 = f6,f8,f0
(p6) fnma.s1 f10 = f7,f8,f1 ;;
(p6) fma.s1 f9 = f10,f9,f9
(p6) fma.s1 // lots of regs in here
(p6) fma.s1
(p6) fma.s1
(p6) fma.s1
(p6) fma.s1
(p6) fma.s1
(p6) fma.s1
(p6) fnma.d1.s1
(p6) fma.d.s0

My boss and I walked through this sequence one day. You need all these
multiplies to eliminate the approximation error in frcpa. Sadly, I've
forgetten the exact details (we also walked through the timing).

We compared the 1 GHz McKinley to a 2 GHz Willamette (which were both
shipping, and in similar process). The latencies were (roughly) equal
(McKinley can chew through a lot of FP instructions!).

Of course, the Itanium code is a lot bigger, and uses a whole lot more power
(chargining up all those register port reads and writes, and all the
predication and bypass). I guess it didn't matter, because McKinley had
solved the delta-power problem that plagued P4 - they burned max power all
the time!

All this because the Itanium architects refused to have long latency
instructions sully their beautiful architecture.

Ned

From: Anton Ertl on 18 Jun 2010 08:08

Stephen Sprunk <stephen(a)sprunk.org> writes:
>On 17 Jun 2010 10:24, Anton Ertl wrote:
>> Stephen Sprunk <stephen(a)sprunk.org> writes:
>>> On 17 Jun 2010 00:33, Brett Davis wrote:
>>>> RISC load-store verses x86 Add from memory.
>>>> t = a->x + a->y;
>>>>
>>>> RISC
>>>> load x,a[0]
>>>> load y,a[1]
>>>> add t = x,y
>>>
>>> load r1, a[0]
>>> load r2, a[1]
>>> add r3, r1, r2
>>> store t, r3
>>>
>>>> x86
>>>> load x,a[0]
>>>> add t = x,a[1]
>>>
>>> load r1, a[0]
>>> add r1, a[1]
>>> store t, r1
>>
>> If t is a local variable, decent C compilers will usually allocate it
>> into a register, and no store is needed.
>
>True, but if you're going to talk about compiler optimizations,

No, I am talking about register allocation.

Here's an example of what the compiler of a student of mine in this
year's compiler course produces for the following program (in a
Algol-family programming language):

struct x y end;

method f(a)
var t:=a.x-a.y; /* no + in this programming language:-) */
return t;
end;

Here's the output:

f:
mov 0(%rsi), %rdx
sub 8(%rsi), %rdx
mov %rdx, %rax
ret

It's easy to see that a resides in %rsi and t resides in %rdx. Does
this compiler optimize? No. E.g., it did not perform the copy
propagation or register coalescing that would have allowed to optimize
the last (return) mov away.

>then
>odds are the code is unlikely to resemble what you wrote in a HLL in the
>first place except for the most trivial of programs.

Have you ever looked at the output of an optimizing compiler? Most of
the time the translation is pretty straightforward.

>The point I was trying to make is that x86 has no 3-operand add
>instruction like the one he used in his example,

True, but it's not needed. That could be written just as well as:

load t=a[0]
add t+=a[1]

>nor does RISC allow a
>memory address as the destination of an add instruction as he did in his
>example.

He didn't. The destination is in a register.

>> What you may be thinking of is that the microarchitectures of current
>> high-performance CISC and RISC CPUs are relatively similar, and quite
>> different from the microarchitectures of CISC and RISC CPU when RISCs
>> were introduced.
>
>Alternately, one can look at a modern x86 chip as a core that runs a
>model-specific RISC ISA hidden behind a decoder that translates x86 CISC
>instructions into that ISA.

Let's see.

Intel: The original P6 uops have 118 bits (they may have grown since
then; the P6 is the basis of the Core i line) according to
Microprocessor report 9(2). A bit longer than a RISC instruction.

AMD: The K7/K8/K10 microarchitecture contains macro-ops (including
read-modify-write instructions), that are later split into micro-ops
(which still include a micro-instruction that does a read and a write
to the same address). Quite unriscy features.

>That may offend purists, but IMHO it's
>accurate enough for those of us who don't actually design CPUs.

Yes, given that the interface is the ISA, you can invent any fairy
tale you like about what's going on behind that interface, and hardly
anybody will care (apart from the few of us who try to make sense of
the performance counters, and even we prefer to count architectural
events like committed instructions over microarchitectural events such
as started uops).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: Stephen Sprunk on 18 Jun 2010 09:17

On 18 Jun 2010 07:08, Anton Ertl wrote:
> Stephen Sprunk <stephen(a)sprunk.org> writes:
>> On 17 Jun 2010 10:24, Anton Ertl wrote:
>>> Stephen Sprunk <stephen(a)sprunk.org> writes:
>>>> On 17 Jun 2010 00:33, Brett Davis wrote:
>>>>> RISC load-store verses x86 Add from memory.
>>>>> t = a->x + a->y;
>>>>>
>>>>> RISC
>>>>> load x,a[0]
>>>>> load y,a[1]
>>>>> add t = x,y
>>>>
>>>> load r1, a[0]
>>>> load r2, a[1]
>>>> add r3, r1, r2
>>>> store t, r3
>>>>
>>>>> x86
>>>>> load x,a[0]
>>>>> add t = x,a[1]
>>>>
>>>> load r1, a[0]
>>>> add r1, a[1]
>>>> store t, r1
>>>
>>> If t is a local variable, decent C compilers will usually allocate it
>>> into a register, and no store is needed.
>>
>> True, but if you're going to talk about compiler optimizations,
>
> No, I am talking about register allocation.

That is a (very basic) optimization.

Have you looked at what GCC does at -O0, i.e. with all optimization
disabled? It translates each statement into self-contained assembly
which loads, operates on, and then stores the relevant variables--even
if two successive statements operate on the same variables. For instance:

x=a+b;
y=a+b;

gets translated into something like this:

load r1, a
load r2, b
add r1, r2
store x, r1
load r1, a
load r2, b
add r1, r2
store y, r1

> It's easy to see that a resides in %rsi and t resides in %rdx. Does
> this compiler optimize? No. E.g., it did not perform the copy
> propagation or register coalescing that would have allowed to optimize
> the last (return) mov away.

Not performing optimization A isn't proof that you're not performing
optimization B.

>> then odds are the code is unlikely to resemble what you wrote in a
>> HLL in the first place except for the most trivial of programs.
>
> Have you ever looked at the output of an optimizing compiler? Most of
> the time the translation is pretty straightforward.

I have plenty of times. Perhaps my assembly skills aren't as good as
yours, but for the most part, I see little resemblance between the C
code and the assembly (for non-trivial functions) when I crank up GCC to
maximum optimization. Loop unrolling, strength reduction, dead code
elimination, common sub-expression elimination, load hoisting, inlining,
etc. can all cause significant changes.

>> The point I was trying to make is that x86 has no 3-operand add
>> instruction like the one he used in his example,
>
> True, but it's not needed. That could be written just as well as:
>
> load t=a[0]
> add t+=a[1]

If you're going to load a[0] and a[1], you need to store t. It's called
symmetry.

>> nor does RISC allow a memory address as the destination of an add
>> instruction as he did in his example.
>
> He didn't. The destination is in a register.

If you're going to claim that t lives in a register, then you might as
well claim that a[0] and a[1] do as well and eliminate those loads.
However, the OP didn't do that, and in any case that just means the
loads (and stores) are probably somewhere else in the program, so
eliminating them from the snippet does not paint a true picture of
what's going on.

>>> What you may be thinking of is that the microarchitectures of current
>>> high-performance CISC and RISC CPUs are relatively similar, and quite
>>> different from the microarchitectures of CISC and RISC CPU when RISCs
>>> were introduced.
>>
>> Alternately, one can look at a modern x86 chip as a core that runs a
>> model-specific RISC ISA hidden behind a decoder that translates x86 CISC
>> instructions into that ISA.
>
> Let's see.
>
> Intel: The original P6 uops have 118 bits (they may have grown since
> then; the P6 is the basis of the Core i line) according to
> Microprocessor report 9(2). A bit longer than a RISC instruction.

And how big are instructions in a traditional RISC core after decoding?
Is that even relevant, since the point of RISC is reduced _complexity_
rather than _size_? (RISC programs are usually bigger than CISC ones,
both in total and average instruction size, and modern RISCs have larger
instruction sets as well.)

> AMD: The K7/K8/K10 microarchitecture contains macro-ops (including
> read-modify-write instructions), that are later split into micro-ops
> (which still include a micro-instruction that does a read and a write
> to the same address). Quite unriscy features.

How is that possible, since AMD's own docs say that a plain write
requires a store-address uop and a store-data uop?

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?