RISC load-store verses x86 Add from memory. [Computer Architecture]

Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?

From: Terje Mathisen "terje.mathisen at on 25 Jun 2010 03:44

Andy 'Krazy' Glew wrote:
> Overflow ::
>
> a >= 0 && b >= 0 ::
>
> a+b >= a ==> no overflow
> a+b < a ==> overflow
>
> a < 0 && b >= 0 ::
> a >= 0 && b < 0 ::
> no overflow
> (although you can have underflow,
> negative overflow, which is handled
> similarly)

No, you cannot get that: Opposite signs are totally safe.
>
> a < 0 && b < 0
> no overflow
> (although you can have underflow,
> negative overflow, which is handled
> similarly)

Right.

The rule isn't too bad then:

sum = a+b;
possible_overflow = (a ^ b) > 0; // I.e. same sign
overflow = possible_overflow &&
((sum ^ a) < 0); // Sum of sign have changed!

Alternatively:

sum = a+b;
no_overflow = ((a ^ b) < 0) || ((sum ^ a) > 0);

or:

sum = a+b;
no_overflow = ((a ^ b) | (~sum ^ a)) < 0;

Checking...

a b sum no_overflow
+ + + 1
+ + - 0
+ - + 1
+ - - 1
- + + 1
- + - 1
- - + 0
- - - 1

I.e. that seems to work.

Latencywise it is better to invert a (or b) instead of the sum:

sum = a+b;
no_overflow = ((a ^ b) | (sum ^ ~a)) < 0;

The next stage is to convert this flag into a properly saturated result:

sum = a+b;
no_overflow = ((a ^ b) | (sum ^ ~a)); // True if negative
mask = no_overflow >> (INT_BITS-1); // Smear the sign bit
sum = (sum & mask) |
((SAT_MIN ^ ((a >> (INT_BITS-1)) & ~mask);

I.e. the result is either the original result, if the mask is all 1
bits, or a saturated value: Either SAT_MIN if a (and b!) was positive,
or SAT_MAX which is SAT_MIN with all bits inverted.

The problem is the total latency:

; mov eax,[a]
; mov ebx,[b]

lea esi,[eax+ebx] // sum
xor ebx,eax // b ^= a
mov edx,eax
xor eax,-1 // ~a

xor eax,esi // ~a ^ sum
sar edx,31 // a >> 31

or eax,ebx // (a ^ b) | (~a ^ sum)

sar eax,31 // mask value, -1 or 0

and esi,eax
xor eax,-1

and eax,edx

xor eax,0x7fffffff

or eax,esi

I.e. at least 8 cycles.

OTOH, this is just two cycles more than the branchless CMOVcc version,
and the code above is not too dissimmilar from what a compiler would
generate.

Terje

PS. It is interesting that I only had to use a single MOV in the code
above to compensate for the fact that x86 is using two-operand instead
of three-operand instructions, and that MOV did not add any extra
latency at all.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Anton Ertl on 25 Jun 2010 11:14

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
> sum = a+b;
> possible_overflow = (a ^ b) > 0; // I.e. same sign

That must be (a^b)>=0, or you get false for a=b.

> sum = a+b;
> no_overflow = ((a ^ b) | (~sum ^ a)) < 0;

Faster on some machines, more elegant, and probably easier to explain:

no_overflow = ((a^sum)&(b^sum))>=0

But of course, all of this is outside the standardized subset of C.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: Jeremy Linton on 25 Jun 2010 13:54

On 6/22/2010 5:06 AM, Terje Mathisen wrote:

> Watcom allowed you to define pretty much any operation yourself, in the
> form of inline macro operations where you specified to the compiler
> where the inputs needed to be, where the result would end up and exactly
> which registers/parts of memory would be modified.
>
> This basically meant that you could extend the built-in range of basic
> operations while still getting pretty well-optimized code.

How is that different from GCC's extended inline assembly
(input/output/clobber lists)? The linux kernel uses macro's all over
place to inline system register moves, or similar behaviors on assorted
arch's. Looking at the resultant code, gcc seems to do a very good job
of register scheduling even through blocks of inline assembly.

From: Terje Mathisen "terje.mathisen at on 25 Jun 2010 15:51

Jeremy Linton wrote:
> On 6/22/2010 5:06 AM, Terje Mathisen wrote:
>
>> Watcom allowed you to define pretty much any operation yourself, in the
>> form of inline macro operations where you specified to the compiler
>> where the inputs needed to be, where the result would end up and exactly
>> which registers/parts of memory would be modified.
>>
>> This basically meant that you could extend the built-in range of basic
>> operations while still getting pretty well-optimized code.
>
> How is that different from GCC's extended inline assembly
> (input/output/clobber lists)? The linux kernel uses macro's all over

It isn't.

Watcom had (imho) a somewhat more elegant syntax to describe this stuff,
but the end result is pretty close to identical.

> place to inline system register moves, or similar behaviors on assorted
> arch's. Looking at the resultant code, gcc seems to do a very good job
> of register scheduling even through blocks of inline assembly.

:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: MitchAlsup on 25 Jun 2010 19:18

On Jun 24, 4:04 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> In your dreams.

Reminds me of the old idium: "Never bet against you branch predictor".

Mitch

First | Prev | Next | Last
Pages: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?