RISC load-store verses x86 Add from memory. [Computer Architecture]

Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?

From: Andy 'Krazy' Glew on 24 Jun 2010 09:03

On 6/23/2010 11:18 PM, EricP wrote:
> Terje Mathisen wrote:
>> Thomas Womack wrote:
>>> In
>>> article<a6d4ec20-9052-4003-a3c7-486885d791a4(a)q12g2000yqj.googlegroups.com>,
>>>
>>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>>> # define sat_add(a,b) (((tmp =3D (a)+(b)), (tmp> SAT_MAX ? SAT_MAX:
>>>> (tmp< SAT_MIN ? SAT_MIN : tmp)))
>>>
>>> And what type is 'tmp'?
>>
>> Any signed type with at least one more bit of precision than a or b?
>> :-)
>>
>> Terje
>>
>
> Ok, a signed (2's complement) overflow detect is:
> sum = a + b;
> overflow = ((a^sum)&(b^sum)) < 0;
>
> so....
>
> #define sat_add(a,b) ((sum=(a)+(b)), (((a)^sum)&((b)^sum)) < 0 ? \
> (sum < 0 ? SAT_MAX : SAT_MIN) : sum)

It will be fun to collect a list of idioms for overflow detection in not-really-portable C assuming 2's complement
signed and/or normal binary unsigned.

From: Andy 'Krazy' Glew on 24 Jun 2010 09:37

On 6/23/2010 11:18 PM, EricP wrote:
> Ok, a signed (2's complement) overflow detect is:
> sum = a + b;
> overflow = ((a^sum)&(b^sum)) < 0;
>
> so....
>
> #define sat_add(a,b) ((sum=(a)+(b)), (((a)^sum)&((b)^sum)) < 0 ? \
> (sum < 0 ? SAT_MAX : SAT_MIN) : sum)

I'm going to start collecting these at

https://semipublic.comp-arch.net/wiki/overflow_detection_idioms

that page is writeable only by me, but anyone can write to its shadow page

http://wiki.public.comp-arch.net/wiki/Overflow_detection_idioms

if you want to add.

Although I suggest just posting to comp.arch; I'll collect.

From: Tim McCaffrey on 24 Jun 2010 10:20

In article
<cb699368-3591-4f59-a35e-df5878d5bce8(a)z31g2000vbk.googlegroups.com>,
MitchAlsup(a)aol.com says...
>
>On Jun 23, 10:42=A0am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
>wrote:
>> On 6/22/2010 11:12 AM, Tim McCaffrey wrote:
>>
>> > 2) Easy to decode: =A0reduces gate count, which reduces power consumpti=
>on, and
>> > potentially removes a pipeline stage (maybe). =A0AFAICT, every x86 has =
>a
>> > limitation of only being able to decode/issue one instruction if it has=
>n't
>> > been executed before. =A0It appears all x86 implementations use the I-c=
>ache to
>> > mark instruction boundaries for parallel decoding on the following pass=
>es.
><snip>
>> AMD has long had this limit.
>
>No, not quite. When the Athlon/Opteron processors fetch an instruction
>that has no marker bits, it decodes 4 bytes per cycle. There can be
>0,1,2,3, or 4 instructions, and the decode pipeline is capable of
>doing 0,1,2,3 from there. A majority of the time, the choice is from
>the set {0,1} due to boundary spanning.
>

So, ISA does matter here. If the ISA was easier to decode then more
instructions could be decoded on the first pass.

- Tim

From: MitchAlsup on 24 Jun 2010 15:49

On Jun 24, 9:20 am, timcaff...(a)aol.com (Tim McCaffrey) wrote:
> In article
> <cb699368-3591-4f59-a35e-df5878d5b...(a)z31g2000vbk.googlegroups.com>,
> MitchAl...(a)aol.com says...
>
>
>
>
>
>
>
> >On Jun 23, 10:42=A0am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
> >wrote:
> >> On 6/22/2010 11:12 AM, Tim McCaffrey wrote:
>
> >> > 2) Easy to decode: =A0reduces gate count, which reduces power consumpti=
> >on, and
> >> > potentially removes a pipeline stage (maybe). =A0AFAICT, every x86 has =
> >a
> >> > limitation of only being able to decode/issue one instruction if it has=
> >n't
> >> > been executed before. =A0It appears all x86 implementations use the I-c=
> >ache to
> >> > mark instruction boundaries for parallel decoding on the following pass=
> >es.
> ><snip>
> >> AMD has long had this limit.
>
> >No, not quite. When the Athlon/Opteron processors fetch an instruction
> >that has no marker bits, it decodes 4 bytes per cycle. There can be
> >0,1,2,3, or 4 instructions, and the decode pipeline is capable of
> >doing 0,1,2,3 from there. A majority of the time, the choice is from
> >the set {0,1} due to boundary spanning.
>
> So, ISA does matter here. If the ISA was easier to decode then more
> instructions could be decoded on the first pass.

The loss in performace of doing it this way, as oppposed to figuring
out how to build a 3-wide unmarked decoder is down in the noise (close
to but slightly under 1% for Opteron.)

But the point I want to make here, is tha the word 'decode' is being
used improperly. A better word would be 'parse' as in the instructions
have been 'parsed' out of the byte stream. The instructions have not
been decoded--which in the RICS universe meant registers have been
read and reservation station entries written. This is still a couple
cycles down the pipe for x86. These are simply pipe stages with a cost
just above the noise floor (under 2%), the actual hard part of x86
instruction processing is determining where the first instruction-byte
of the NEXT parse cycle starts. The rest of it is "not-so-bad".

Yes, there is a lot of cruft in x86. However, there are more efficient
automotive engines (sterling) and more powerful automotive engines
(turbine) but nothing today results in the utility per unit cost of
the internal compustion engine. And so it is with x86 (for better or
for worse,.)

----------------------------

But back on topic: Can anyone show me a piece of code that depends on
integer overflow to create a buffer overrun? (Note: not integer
underflow)

Mitch

From: Terje Mathisen "terje.mathisen at on 24 Jun 2010 17:04

EricP wrote:
> Terje Mathisen wrote:
>> Thomas Womack wrote:
>>> In
>>> article<a6d4ec20-9052-4003-a3c7-486885d791a4(a)q12g2000yqj.googlegroups.com>,
>>>
>>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>>> # define sat_add(a,b) (((tmp =3D (a)+(b)), (tmp> SAT_MAX ? SAT_MAX:
>>>> (tmp< SAT_MIN ? SAT_MIN : tmp)))
>>>
>>> And what type is 'tmp'?
>>
>> Any signed type with at least one more bit of precision than a or b?
>> :-)
>>
>> Terje
>>
>
> Ok, a signed (2's complement) overflow detect is:
> sum = a + b;
> overflow = ((a^sum)&(b^sum)) < 0;
>
> so....
>
> #define sat_add(a,b) ((sum=(a)+(b)), (((a)^sum)&((b)^sum)) < 0 ? \
> (sum < 0 ? SAT_MAX : SAT_MIN) : sum)
>
> Good compiler would use CMOVs so no branches.

In your dreams.
:-(

I believe this is the logic you want to define, right?

mov eax,[a]
add eax,[b]
jno done ; If not Overflow then we're OK
mov eax,SAT_MIN ; Assume negative underflow
jns done ; SAT_MIN unless the sign flag got set
mov eax,SAT_MAX ; SAT_MAX otherwise
done:

That macro above will, for most codes run in about 2 cycles since the
first branch will be correctly predicted as taken.

Using an INTO handler avoids the JNO, making the macro shorter, but it
also increases the cost of handling overflows very significantly. :-(

The problem is when you have an array which is about to become
ill-conditioned and you get unpredictable overflows on about half of all
iterations: In that case we're talking 15-25 cycles or an order of
magnitude slower, due to mispredicted branches.

Using CMOVs instead could help:

mov eax,[a]
mov ebx,SAT_MIN
add eax,[b]
mov edx,SAT_MAX
cmovs ebx,edx ;; Select the proper kind of saturation
cmovo eax,ebx ;; Replace the result with saturated value

This code runs in a totally predictable 6 (!) cycles, which means that
it is only when you have at least 30+% mispredicted branches that this
will be faster than the first macro.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?