From: Terje Mathisen "terje.mathisen at on
Brett Davis wrote:
> Intel was public that predication is a net loss because you
> execute more instructions, and branch predictors are ~99.5%
> accurate: (Before Itanic of course. ;)
>
> Test
> Conditional Add
> Next opcode
>
> Verses:
>
> Test and Branch to Next
> Add (skipped over, never seen by pipeline)
> Next opcode
>
> Of course with compressed data decoding I can break that branch
> predictor hard, and suffer ~15 cycle pipe flushes per fail.
> Ergo CMOVE.
>
> So is CMOVE still implemented internally as a branch?
> (I know this is crazy sounding, but that is what both did...)

Not a real branch, but it did hold up the pipeline for a short while afair?

Also afair CMOV did take constant time, moving the data or not, i.e. it
effectively reads both the source and target, selects according to the
relevant flags and writes back the result.

Unlike a predicated architecture, you cannot execute two simultaneous
CMOV operations on opposite flag values and the same target register.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Kai Harrekilde-Petersen on
Owen Shepherd <owen.shepherd(a)e43.eu> writes:

> wrote:
>
>> Owen Shepherd wrote:
>>> Certainly AVR32 and Thumb show between them that many of the
>>> features can be kept; its just a matter of value - note that
>>> neither Thumb or AVR32 are highly orthogonal architectures like
>>> more traditional designs, as they're much more aimed towards
>>> compilers.
>>
>> Huh?
>>
>> So are you saying that Thunb/AVR32 are non-orthogonal, because they are
>> aimed at compiled code, or that 'more traditional designs' have that aim?
>>
>> The latter does make sense, but not to my non-native reading of your
>> english. :-(
>>
>> Terje
>>
>
> More traditional RISC designs tend to be highly orthagonal; Thumb(2) and
> AVR32 have a bunch of oddness about them because they are designed for
> compilers to target (rather than with the expectation of humans writing
> assembly by hand).
>
> A couple of simple examples from Thumb2:
> 1. Registers r0-r7 are preferred to r8-r12*, because most instructions
> only use 3 bits to encode each opcode (Thumb2 added a bunch of longer
> opcodes to make the upper registers more accessible, but they're
> 32-bit instructions)
> 2. ARM has an array of modes for the STM/LDM modes: increment before,
> increment after, decrement before, decrement after. Thumb only has
> STM decrement before (STMDB) and LDM increment after (LDMIA). This
> is not coincidentally the way the stack operates
>
> Owen
>
> * Remember that r13=SP, r14=LR, r15=PC, so they're somewhat less useful
> from many perspectives

Basically, they've traded orthogonal-ness for code density. In
small/cost-sensitive embedded designs, where the code footprint
determines the significant part of the IC area and thereby cost, this
could be just the right solution.


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
From: Brett Davis on
In article <i3rcrv$44e$1(a)news.eternal-september.org>,
"nedbrek" <nedbrek(a)yahoo.com> wrote:

> Hello all,
>
> "Brett Davis" <ggtgp(a)yahoo.com> wrote in message
> news:ggtgp-776ECA.23240209082010(a)news.isp.giganews.com...
> >
> > So is CMOVE still implemented internally as a branch?
> > (I know this is crazy sounding, but that is what both did...)
>
> The biggest problem with CMOV is the renamer (so, it is easy to handle for
> an in-order machine).
>
> Given the sequence
> ld r4 = [r0]
> add r1 += r3
> cmov r1 = zf ? r1 : r4
> sub r6 -= r1
>
> When you rename the subtract, you need to connect it to either the
> instruction producing r1 (the add) or the producer of r4 (based on the
> flags, which have [potentially] a third producer).

I was assuming that CMOV was just another ALU op like ADC, which is
an add plus a flag used as a bit. No re-namer issues possible.
Just shorthand for a full binary select: r1 = r2 ? r3 : r4

So CMOV does not issue to the integer unit, it just acts as a rename.
You end up with a nop, or two pointers to r4, r1 and r4 both the same.

This is only a win if all three integer pipes are full, because it
will still take a cycle to turn the flag into a rename.
(Will save power, no register reads, math or register write.)

Correct?

Brett
From: Terje Mathisen "terje.mathisen at on
Kai Harrekilde-Petersen wrote:
> Owen Shepherd<owen.shepherd(a)e43.eu> writes:
>> A couple of simple examples from Thumb2:
>> 1. Registers r0-r7 are preferred to r8-r12*, because most instructions
>> only use 3 bits to encode each opcode (Thumb2 added a bunch of longer
>> opcodes to make the upper registers more accessible, but they're
>> 32-bit instructions)

x86 prefers AL/AX/EAX for many instructions, since they have special,
shorter encodings.

It also prefers, in 64-bit mode, the 8 old registers vs the 8 new, since
those new regs require an extra prefix byte.

>> 2. ARM has an array of modes for the STM/LDM modes: increment before,
>> increment after, decrement before, decrement after. Thumb only has
>> STM decrement before (STMDB) and LDM increment after (LDMIA). This
>> is not coincidentally the way the stack operates

x86 has a real stack...
>>
>> Owen
>>
>> * Remember that r13=SP, r14=LR, r15=PC, so they're somewhat less useful
>> from many perspectives
>
> Basically, they've traded orthogonal-ness for code density. In
> small/cost-sensitive embedded designs, where the code footprint
> determines the significant part of the IC area and thereby cost, this
> could be just the right solution.

I sort of accept all that, what I don't get is the fact that for more or
less my entire IT career, I've been told that all the special x86
instructions with fixed and/or implied register operands made it very
hard/impossible to generate really good compiled code, and that this
problem was solved by having more registers and an othogonal instruction
set, i.e. RISC. :-)

(Personally I've never really understood what was so hard about x86,
except for register pressure, mapping algorithms onto the
register/instruction set have felt quite natural.)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nedbrek on
Hello,

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-00E59C.22370510082010(a)news.isp.giganews.com...
> In article <i3rcrv$44e$1(a)news.eternal-september.org>,
> "nedbrek" <nedbrek(a)yahoo.com> wrote:
>
>> "Brett Davis" <ggtgp(a)yahoo.com> wrote in message
>> news:ggtgp-776ECA.23240209082010(a)news.isp.giganews.com...
>> >
>>
>> The biggest problem with CMOV is the renamer (so, it is easy to handle
>> for
>> an in-order machine).
>>
>> When you rename the subtract, you need to connect it to either the
>> instruction producing r1 (the add) or the producer of r4 (based on the
>> flags, which have [potentially] a third producer).
>
> I was assuming that CMOV was just another ALU op like ADC, which is
> an add plus a flag used as a bit. No re-namer issues possible.
> Just shorthand for a full binary select: r1 = r2 ? r3 : r4
>
> So CMOV does not issue to the integer unit, it just acts as a rename.
> You end up with a nop, or two pointers to r4, r1 and r4 both the same.
>
> This is only a win if all three integer pipes are full, because it
> will still take a cycle to turn the flag into a rename.
> (Will save power, no register reads, math or register write.)
>
> Correct?

You're talking about executing the op in the renamer? It's possible, but
more complicated than executing a move - you need to bring the flags value
from the execution engine back to the renamer. Most execute-in-rename
proposals only execute movs and alu ops with immediates (e.g. add r1 +=
imm). That way, you can just store an offset value in the map (no need to
bring values in from execute).

Certainly possible, but a lot more effort.

Of course, I don't know of any shipping mainstream processors with
execute-in-rename... (not counting special handling for SP)

Ned