From: Rod Pemberton on

"Wolfgang Kern" <nowhere(a)never.at> wrote in message
news:ff358o$o9$1(a)newsreader2.utanet.at...
> You may wonder how often CMOVcc and SETcc occure in programs targeted
> to +486 CPUs.
> As this two can save on many branch-instruction, I started to rewrite all
> my older code and gain ~20% speed in average without increasing its size.
>

I'm curious as to how you do this with SETcc. My understanding is that
SETcc:
1) is non-pairable
2) only operates on 8-bit operands
3) requires 'xor reg,reg' or 'sub reg,reg' to the full register prior to
the instruction to prevent a partial register stall
4) requires an 'movzx' or 'movsz' to the full register after the
instruction to prevent a partial register stall

From what I can tell, for unsigned results it's faster to use full register
combinations of sbb,cmp,xor due to:
1) pairing
2) partial pairing
3) slow movzx/sx otherwise required

That's what I got from the Intel optimization manuals, anyway... What'd I
miss?


Rod Pemberton

From: Wolfgang Kern on

Rod Pemberton wrote:

>> You may wonder how often CMOVcc and SETcc occure in programs targeted
>> to +486 CPUs.
>> As this two can save on many branch-instruction, I started to rewrite all
>> my older code and gain ~20% speed in average without increasing its size.

> I'm curious as to how you do this with SETcc. My understanding is that
> SETcc:
> 1) is non-pairable
> 2) only operates on 8-bit operands
> 3) requires 'xor reg,reg' or 'sub reg,reg' to the full register prior to
> the instruction to prevent a partial register stall
> 4) requires an 'movzx' or 'movsz' to the full register after the
> instruction to prevent a partial register stall

The idea is not (but also possible) to work out values with SETcc,
I mainly use it as temporary needed condition status storage instead
of PUSHF/POPF and/or in addition to LAHF/SAHF.

I usually try to work in registers, and we got eight GP byte-regs,
but when registers become rare I may even use:

________(this isn't actual code)________
PUSH +0 ;creates four 'local' bytes
...
SETcc [esp+x] ;x can be 0..3 yet
...
CMP byte[esp+x],0
SETcc ..
CMOVnz ..
;also working:
TEST dword[esp],0x01010001 ;to check several states at once
CMOVz ..
;sometimes helpful for table offsets:
MOV ecx,0x00480080
OR eax,eax
SETz CH ;adjust ecx to 480180h or 480080h
CMOVs eax,[ecx]
...
LEA esp,[esp+4] ;instead of ADD ESP,4 to keep flags alive
________________

> From what I can tell, for unsigned results it's faster to use full
register
> combinations of sbb,cmp,xor due to:
> 1) pairing
> 2) partial pairing
> 3) slow movzx/sx otherwise required

Yes, except 3) MOVSX/ZX and Shifts aren't at all slow on AMD's.

> That's what I got from the Intel optimization manuals, anyway...
> What'd I miss?

The penalties for partial register stalls and unaligned byte access
are easy gained back by saving on branch-instructions and code size.

Sure, avoiding SETcc may improve speed, but on cost of code size and
the need for more registers/locals, which again will increase timing.
__
wolfgang