From: Andy 'Krazy' Glew on
On 6/23/2010 1:28 AM, Terje Mathisen wrote:
> Thomas Womack wrote:
>> In
>> article<a6d4ec20-9052-4003-a3c7-486885d791a4(a)q12g2000yqj.googlegroups.com>,
>>
>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>> # define sat_add(a,b) (((tmp =3D (a)+(b)), (tmp> SAT_MAX ? SAT_MAX:
>>> (tmp< SAT_MIN ? SAT_MIN : tmp)))
>>
>> And what type is 'tmp'?
>
> Any signed type with at least one more bit of precision than a or b?

And that is the problem. What if you are working in the largest integral type? What if you do not know what type you
are working with?

Whereas, if you use the normal behaviour of 2's complement integers (signed - what does it mean to say that a 2's
complement number is unsigned?)

#define sat_add(a,b) ((typeof<a>(a+b)>(a))&&(typeof<a>(a+b)>(b))?(a+b):SAT_MAX)

works for all 2's complement types. signed. and, yes, unsigned.


---


What code detects overflow for the largest integer type?

From: Andy 'Krazy' Glew on
On 6/22/2010 11:12 AM, Tim McCaffrey wrote:

> 2) Easy to decode: reduces gate count, which reduces power consumption, and
> potentially removes a pipeline stage (maybe). AFAICT, every x86 has a
> limitation of only being able to decode/issue one instruction if it hasn't
> been executed before. It appears all x86 implementations use the I-cache to
> mark instruction boundaries for parallel decoding on the following passes.

Not true.

The original Intel P6 had a 411 decoder template - it could decode 1 complex instruction, and 2 simple instructions, per
cycle.

Willamete's trace cache had the limit you describe.

I think Nehalem may have such a limit right now.

P5 had such a limit. Turned out that P6's decoder was a large part of its advantage over P5.

AMD has long had this limit.

But superscalar x86 decode has been done.
From: Andy 'Krazy' Glew on
On 6/22/2010 2:18 PM, MitchAlsup wrote:
> On Jun 22, 1:12 pm, timcaff...(a)aol.com (Tim McCaffrey) wrote:

>> You say ISA doesn't matter, but you note several cases where extra gates were
>> added to handle x86ism's.
>
> So, x86 has an operand multiplexer in the logic cycle of the integer
> units. Big deal, 2 gate delays out of 16-18 total computational gates
> (plus flop jitter and skew). The thing that more than completely
> compensates for this is the FAB can achieve 2X the frequency with
> these 2 extra gates of delay than any of the non-big-guys FABs can
> produce without those 2 little gates of added delay. In the load
> aligner, you have another 2 gates to deal with misalignedness. None of
> these gate mater unless the compitition has access to similar FAB
> technology.


By the way, at the moment the trend is to ADD such extra gate delays - to increase the gates per clock. More
gates/clock => better tolerance of transistor variability => better yield. Which corresponds to higher
frequency/performance in higher skews.

So, if you can add extra gate delays, it may help. Assuming they are not gratuitous - assuming they help some
reasonable fraction of the code.

Candidates include sign extending loads - although loads already have such a deep path.

I was at a conference earlier this year where somebody was proposing logic cone smashing - smashing 4-8 dependent
logical operations into a single instruction. E.g. instructions like A&B|C^D ... Particularly if there are only two
register inputs, and all of the other operands are constants (or pseudo-constants, values from constant registers that
are loaded way in advance).

(BTW, this logic cone smashing is not a new idea, just a revival of an old idea. Wheel of Reincarnation.)
From: nmm1 on
In article <lfGdnTl_14xbvr_RnZ2dnUVZ_i2dnZ2d(a)giganews.com>,
Andy 'Krazy' Glew <ag-news(a)patten-glew.net> wrote:
>
>> The former, no, but you are wrong with the latter. The point is that
>> you can't do any significant code rearrangement if you want to either
>> 'capture what the hardware does' or produce deterministic results.
>> That shows up much more clearly with IEEE 754, but the same applies
>> to integers once you do anything non-trivial or have (say) a twos'
>> complement model with an overflow flag (i.e. like IEEE 754).
>
>2's complement WITHOUT an overflow flag CAN be rearranged significantly.
>
>This is a "Just think about it" issue.

I have. Perhaps our definitions of "non-trivial" differ. Well,
actually, I am pretty sure that they do in this context.


Regards,
Nick Maclaren.
From: Terje Mathisen "terje.mathisen at on
Andy 'Krazy' Glew wrote:
> Whereas, if you use the normal behaviour of 2's complement integers
> (signed - what does it mean to say that a 2's complement number is
> unsigned?)
>
> #define sat_add(a,b)
> ((typeof<a>(a+b)>(a))&&(typeof<a>(a+b)>(b))?(a+b):SAT_MAX)
>
> works for all 2's complement types. signed. and, yes, unsigned.

Huh???

What happens when both a and b are negative?

(-1 + -2) is less than both -1 and -2, so both parts of that test will
agree that the proper answer is SAT_MAX: Probably not what you want!

The next issue is of course when you do (-100 + -100) with 8-bit values
and end up with +56 instead of -200 or a saturated -128.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"