From: MitchAlsup on
On Jun 18, 2:28 pm, r...(a)clozure.com (R. Matthew Emerson) wrote:
> "nedbrek" <nedb...(a)yahoo.com> writes:
> > But Mitch has it right.  Architecture does not matter.

I think it might be better to say, Instruction sets don't mater, the
rest of what we call architecture does mater, now and again.

> You guys keep saying this, and maybe for large majority of people it is
> even true.
>
> But I still say that ISA makes a difference.  As an example, our Common
> Lisp implementation targeted only PowerPC for a long time.  

Here we have the classic mismatch of architecture and application. I
might note that those machines that had the instruction set
infrastructure to support <the various> LISPs did not end up surviving
into the present (save, <ahem> SPARC). These architectures were also
pretty good at Prolog, and at emulating other instruction sets.

Me, I write LISP in C. That is, for those applications (and there are
some) that are best written in the LISP style (without a self
interpreting nature,) I write them in C. The 88K assembler code
scheduler is one in particular.

Mitch

{Note: I am in no way deriding you product or its needs.}
From: nmm1 on
In article <0922168e-6d6f-4480-85ec-fa5996c336a7(a)z10g2000yqb.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:
>On Jun 18, 2:28=A0pm, r...(a)clozure.com (R. Matthew Emerson) wrote:
>> "nedbrek" <nedb...(a)yahoo.com> writes:
>> > But Mitch has it right. =A0Architecture does not matter.
>
>I think it might be better to say, Instruction sets don't mater, the
>rest of what we call architecture does mater, now and again.

Er, no. Sorry. I agree that the days when the instruction set made
a big difference to the performance are long gone - but that's only
20 years gone, not 40.

However, the same does NOT apply to RAS and usability. Any defects
cause trouble to compilers and debuggers, and one result is higher
software costs, and lower RAS RAS and usability. Also, most weird
properties, dogmas etc. have a tendency to show through.

A classic example here is the way that integer overflow used to be
(and occasionally still is) trapped - but overflow in multiplication
rarely was. Why? Well, the basic intructions rarely did ....


Regards,
Nick Maclaren.
From: Brett Davis on
In article <2010Jun17.172422(a)mips.complang.tuwien.ac.at>,
anton(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> Stephen Sprunk <stephen(a)sprunk.org> writes:
> >On 17 Jun 2010 00:33, Brett Davis wrote:
> >> RISC load-store verses x86 Add from memory.
> >> t = a->x + a->y;
> >>
> >> RISC
> >load r1, a[0]
> >load r2, a[1]
> >add r3, r1, r2
> >
> >> x86
> >load r1, a[0]
> >add r1, a[1]
>
> >> RISC shows its superiority by being 50% more instructions and 50% slower...
>
> It's just as easy to find an example where IA-32 and AMD64 have 100%
> more instructions:
>
> x = y+z;
>
> where x, y, and z are locals that live in registers, and y and z are
> alive after this statement. On RISC:
>
> add x<-y+z;
>
> On IA-32/AMD64:
>
> mov x<-y
> add x<-x+z

"Move elimination" has been mentioned in this thread, and I confirmed that
Intel is merging the load micro-op into the add micro-op.
From "Intel� 64 and IA-32 Architectures Optimization Reference Manual"
page 2-9, section 2.1.2.6:
http://www.intel.com/Assets/PDF/manual/248966.pdf
http://www.intel.com/products/processor/manuals/
(AMD may have done this first, each AMD integer unit has a address unit.)

"
2.1.2.6 Micro-fusion
Micro-fusion fuses multiple u-ops from the same instruction into a single complex
u-op. The complex u-op is dispatched in the out-of-order execution core. Micro-fusion
provides the following performance advantages:
� Improves instruction bandwidth delivered from decode to retirement.
� Reduces power consumption as the complex ?op represents more work in a
smaller format (in terms of bit density), reducing overall �bit-toggling� in the
machine for a given amount of work and virtually increasing the amount of
storage in the out-of-order execution engine.
Many instructions provide register flavors and memory flavors. The flavor involving a
memory operand will decodes into a longer flow of ?ops than the register version.
Micro-fusion enables software to use memory to register operations to express the
actual program behavior without worrying about a loss of decode bandwidth.
"

See also "2.1.2.4 Instruction Decode"

You can do the same with a OoO RISC chip, but its harder, I believe you would need
an extra write port. I do not know of a RISC chip that does fusion with reads,
I do know that PowerPC does do some Micro-fusion on other opcodes.

We are back to my original question, is Add from Memory RISCier than RISC
for a hugely OoO design?

(The real win is less than 50%, far less, you have to be starved for issue slots.)
The power savings is real, and important.
From: jacko on
On Jun 19, 8:06 am, Brett Davis <gg...(a)yahoo.com> wrote:
> In article <2010Jun17.172...(a)mips.complang.tuwien.ac.at>,
>  an...(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> We are back to my original question, is Add from Memory RISCier than RISC
> for a hugely OoO design?
>
> (The real win is less than 50%, far less, you have to be starved for issue slots.)
> The power savings is real, and important.

Yes. I wonder in my NiBZ design if adding extra cycles in the
instruction will significantly reduce area/power? The shadow registers
take up some space, and the 3 in 1 decoder takes up time, making to
memory speed lower. reducing Fmax. Maybe a single extra cycle could do
both these. Freeing up space for something else of use.

Cheers Jacko.
From: nedbrek on
Hello all,

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-6935B6.02064019062010(a)news.isp.giganews.com...
>
> 2.1.2.6 Micro-fusion
>
> You can do the same with a OoO RISC chip, but its harder, I believe you
> would
> need an extra write port. I do not know of a RISC chip that does fusion
> with
> reads, I do know that PowerPC does do some Micro-fusion on other opcodes.
>
> We are back to my original question, is Add from Memory RISCier than RISC
> for a hugely OoO design?
>
> (The real win is less than 50%, far less, you have to be starved for issue
> slots.) The power savings is real, and important.

I believe the sequence you are describing is:
add r1 += [r2]

The advantage CISC has is that the uop sequence looks like:
ld tmp = [r2]
add r1 += tmp

Since tmp is not an architected register, it does not have to be preserved
for an interrupt, or seen past the use in add (it is known dead). Thus, it
can exist strictly in the bypass network (it is not allocated a rename
register, it is not visible to later instructions [does not participate in
renaming], and has no architected effects at retirement).

The RISC sequence will always be (ld r3 = [r2]; add r1 += r3). r3 is live
out, and must be architecturally visible. You can smash ops together,
giving you r3,r1 = load-op [r2] + r1

You can't say just "need an extra write port" unless you have a simple 5
stage pipeline. In a modern machine, this means extra decode bits (in the
scheduler and ROB), extra RAT ports, extra complexity come retirement time
(do you allow every instruction to update two entries in the retirement
register table?)

Ned