RISC load-store verses x86 Add from memory. [Computer Architecture]

Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?

From: Andy 'Krazy' Glew on 18 Jun 2010 11:06

On 6/18/2010 6:17 AM, Stephen Sprunk wrote:
> On 18 Jun 2010 07:08, Anton Ertl wrote:
>> Stephen Sprunk<stephen(a)sprunk.org> writes:
>>> On 17 Jun 2010 10:24, Anton Ertl wrote:
>>>> Stephen Sprunk<stephen(a)sprunk.org> writes:
>>>>> On 17 Jun 2010 00:33, Brett Davis wrote:

>>>> What you may be thinking of is that the microarchitectures of current
>>>> high-performance CISC and RISC CPUs are relatively similar, and quite
>>>> different from the microarchitectures of CISC and RISC CPU when RISCs
>>>> were introduced.
>>>
>>> Alternately, one can look at a modern x86 chip as a core that runs a
>>> model-specific RISC ISA hidden behind a decoder that translates x86 CISC
>>> instructions into that ISA.
>>
>> Let's see.
>>
>> Intel: The original P6 uops have 118 bits (they may have grown since
>> then; the P6 is the basis of the Core i line) according to
>> Microprocessor report 9(2). A bit longer than a RISC instruction.
>
> And how big are instructions in a traditional RISC core after decoding?
> Is that even relevant, since the point of RISC is reduced _complexity_
> rather than _size_? (RISC programs are usually bigger than CISC ones,
> both in total and average instruction size, and modern RISCs have larger
> instruction sets as well.)

Nit-picking:

RISC, as originally defined, stood for "Reduced Instruction Set Computer". (D. A. Patterson and D. R. Ditzel, "The case
for the Reduced Instruction Set Computer," ACM SIGARCH Computer Architecture News, vol. 8, no. 6, pp. 25-33, Oct. 1980.)

This attitude was reflected by "instruction counting". Things like saying "I don't need a register to register move
instruction, since I can synthesize that by doing ADD rdest = rsrc + 0". Or "I don't need a register clear instruction,
since I can synthesize that by doing SUB rdest = rdest - rdest". These things are true - but they also ultimately lead
to more complicated machines, since, as I posted just last night, you may want to optimize register move or register
clear separately - and now to do so you must decode more of the instruction.

By the way, the "register counting" attitude led to many of the irregularities of MMX. While one might think that you
should have all combinations of signed and unsigned, 8, 16, and 32 bit saturation
dest := no_saturation( src1 + src2 ) for 8x8, 4x16, 2x32
dest := unsigned_saturation( src1 + src2 ) for 8x8, 4x16, 2x32
dest := unsigned_saturation( unsigned(src1) + signed(src2) ) for 8x8, 4x16, 2x32
dest := signed_saturation( src1 + src2 ) for 8x8, 4x16, 2x32
etc., since "instruction count" was one of the metrics that MMX was scored on, IMHO gratuitous irregularities were
added. (Anecdote: at the dinner celebrating MMX being committed to product, I passed out packets of Swiss cheese. I
should have gilded them.)

Many of us at the time thought that it should stand for something else. Was it Wirth or Tannenbaum who said "Regular
Instruction Set Computer"? I pitched "SEISM" (Small Efficient Instruction Set Machine) and "RAMM" (Reduced Addressing
Mode Machine"), since I thought that the key think was having reduced operand addressing modes rather than addressing
modes like Mem[Mem[reg+offset1]+offset2]. I had no objection to multi-cycle instructions like FMUL.

At the time many of the RISC advocates were adamant that a true RISC would not have any instruction that could not be
executed in a single ALU cycle back to back, or pipelined like load and store. This attitude led to refusing to allow
multi-cycle instructions like FMUL unless you could afford a full-width 32x32 or 64x64 multiplier. If you could only
afford a 64x8 slice, these folks said that you had to have a RISC instruction that did that, rather having a 64x64
multiply instruction that you either implemented via a hardware state machine or microcode. As for FDIV ...

There are many papers with lists of RISC properties, like "fixed width 32 or 64 bit instructions." "No special
registers". Etc.

x86 ucode does not match all of these definitions of RISC.

From: MitchAlsup on 18 Jun 2010 12:36

On Jun 18, 12:17 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
wrote:
> Perhaps with hardware to recognize the instruction sequence
>
> reg1 := reg0; reg1 += reg2
>
> and emit the 3-input operation
>
> reg1 := reg0 + reg2
>
> Given that reg1 += reg2 is much more common than the general form reg1 := reg0 + reg2, there may be a net savings.
>
> I'm not aware of any x86 procesor that does this, but this technique is well known even from before I joined Intel in
> 1991. At various times it has been called "move elimination".

K9 would have, also did branch fusing, and constant folding (mostly
for stack push/pop arithmetic.)

>
> We also have not talked about the possibility of a load-op pipeline, yes, even on an out-of-order CPU.

While I generally like this style of pipeline for x86 InOrder
machines, it generally brings "nothing to the party" on the OutofOrder
machines.

> But, Brett was asking "Why RISC a:=b+c?", not "Why a+=b or a+=load(mem)?"
>
> And Mitch has provided the answer. x86 has complicated decoding. Market volumes amortize, but it is still a cost.

The other aprt of the answer, is that after a certain level of
pipeline complexity, the difference between x86 and RISC is "just
another" pipe stage--and at some point it fails to mater in any
architecture figure of merit metric. Thus, manufacturing or sales
volume wins.

> All other things being equal, I would rather build a RISC, perhaps a 16-bit a+=b RISC as described above.
>
> But all other things are not equal. Out-of-order is a big leveller. Although, if you wanted to have lots of simple
> cores, you might want to give up x86.

All things being equal, I would like the "Instruction set Wars" to die
so the designers can get on with improving the underlying
architectures.

Mitch

From: nmm1 on 18 Jun 2010 13:15

In article <4C1B8B7C.6080509(a)patten-glew.net>,
Andy 'Krazy' Glew <ag-news(a)patten-glew.net> wrote:
>Nit-picking:
>
>RISC, as originally defined, stood for "Reduced Instruction Set Computer".
>(D.A. Patterson and D. R. Ditzel, "The case for the Reduced Instruction
>Set Computer," ...
>
>At the time many of the RISC advocates were adamant that a true RISC
>would not have any instruction that could not be executed in a single
>ALU cycle back to back, or pipelined like load and store. ...
>
>There are many papers with lists of RISC properties, like "fixed width
>32 or 64 bit instructions." "No special registers". Etc.
>
>x86 ucode does not match all of these definitions of RISC.

As far as I recall, very few of the "RISC architectures" that hit the
marketplace did. Even the ones that did so most dogmatically didn't
do so for floating-point - probably because the religious zealots
didn't actually know anything about it and had to employ heretics
to design that aspect :-)

Regards,
Nick Maclaren.

From: Robert Myers on 18 Jun 2010 13:15

MitchAlsup wrote:

>
> All things being equal, I would like the "Instruction set Wars" to die
> so the designers can get on with improving the underlying
> architectures.
>

Everyone say Amen. The ISA seems to be the level at which the
architecture is understood by a huge number of people who come into any
kind of direct contact with microprocessors. That is to say, it's the
only thing they know enough about to talk about, so they continue to
talk about it, even though it is no longer much of any issue, except as
a market volume issue (x86 vs. Power vs. ia64 vs. ARM vs. MIPS, etc.).

Since, for x86, at least, ISA design is a matter of deciding which
micro-op sequences are to be predefined, I presume there is still room
for discussing trade-offs between decoder complexity and predefinition
of micro-op sequences.

Robert.

From: R. Matthew Emerson on 18 Jun 2010 15:28

"nedbrek" <nedbrek(a)yahoo.com> writes:

> But Mitch has it right. Architecture does not matter.

You guys keep saying this, and maybe for large majority of people it is
even true.

But I still say that ISA makes a difference. As an example, our Common
Lisp implementation targeted only PowerPC for a long time. Porting it
to x86-64 was a lot of work for a couple of reasons:

* An x86 assembler and disassembler are complicated and fiddly.

* Going from 32 registers on PowerPC to 16 on x86-64 was like moving
from a king-size bed down to a double bed.

Later, when we ported the lisp to 32-bit x86, the register problem
became even more acute. We spent a lot of time trying to fit everything
into 8 registers. If x86-64 is a double bed, then 32-bit x86 is a bunk
in a submarine.

I gave a lightning talk about this at the recent International Lisp
Conference. The slides and one-pager from the proceedings are at
http://www.clozure.com/~rme/.

We've recently been porting to ARM. Progress has been surprisingly
rapid, and that is in large part due to the ease of working with the ARM
ISA. An assembler and disassembler are straightforward to write, and
there are enough registers that we're not always looking around
desperately for a place to put things.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?