interrupting for overflow and loop termination [Computer Architecture]

Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit

From: Eliot Miranda on 18 Sep 2005 00:38

John Mashey wrote:
> Eliot Miranda wrote:
>
>>John Mashey wrote:
>>Code sequences for polymorphic add/sub or compare would then look like
>>
>> fetch operand one
>> fetch operand two
>> tagged add/sub
>> branch no overflow Ldone
>> code for non-tagged & overflow cases (method lookup)
>> ...
>>Ldone:
>
>
> Just out of curiosity, what's known about the dynamic frequencies of
> the tweo cases?

I can collect accurate figures in a week (off for a quick vacation :) :)
). But my gut is that in ordinary codes the ratio approaches 100% to
0%. tagged/Small/immediate integers cover the vast array of cases
(iteration, most arithmetic). Only special codes (e.g. security) will
overflow into large integer arithmetic. This is in part confirmation
bias because as Terje points out the relative cost of tagged vs boxed
arithmetic is at least an order of magnitude worse and so the system is
written to avoid the cost as much as possible. But its mostly the fact
that in a symbolic language like Smalltalk the highest dynamic frequency
for integer math is likely to be for iteration.

--
_______________,,,^..^,,,____________________________
Eliot Miranda Smalltalk - Scene not herd

From: Terje Mathisen on 18 Sep 2005 02:15

Eliot Miranda wrote:
>
>
> John Mashey wrote:
>
>> Just out of curiosity, what's known about the dynamic frequencies of
>> the tweo cases?
>
> I can collect accurate figures in a week (off for a quick vacation :) :)

Please do, and good for you! (Collecting figures and vacation).

> ). But my gut is that in ordinary codes the ratio approaches 100% to
> 0%. tagged/Small/immediate integers cover the vast array of cases
> (iteration, most arithmetic). Only special codes (e.g. security) will
> overflow into large integer arithmetic. This is in part confirmation
> bias because as Terje points out the relative cost of tagged vs boxed
> arithmetic is at least an order of magnitude worse and so the system is

Actually, it might be even worse: Even if most bigint math happens to
stay within (say) four words of storage, and even if we disregard the
cost of the trap (because the alternative is to always call outline
functions directly), then you still have the call/return overhead, the
cost of comparing the lengths of the two operands, possibly the cost of
allocating the return value, and the actual cost of doing the multi-word
addition.

The small/immediate case will take just one or two cycles, three at the
very maximum, while it would be very easy to have two or even three
branch misses, each taking ~10 cycles, inside the bigint function.

> written to avoid the cost as much as possible. But its mostly the fact
> that in a symbolic language like Smalltalk the highest dynamic frequency
> for integer math is likely to be for iteration.

Right, it is hard to imagine how it could be otherwise.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: glen herrmannsfeldt on 18 Sep 2005 08:57

Terje Mathisen wrote:

(snip)

> The x86 INTO style single-byte trap would be the best possible way to
> handle it as long as you also had the fast user-level trap handler you
> wished you had been able to fit into MIPS architecture:

How much difference does it make at run time, though.

In the case of a conditional branch based on the overflow bit,
INTO trap instruction, or overflow trap without any added instructions,
nothing after that can be committed until the status bit is available.

I would assume that a branch on overflow would be predicted as not
taken.

Other than the cost of the actual instruction bits, which are pretty
variable over different architectures, would you expect any difference
at runtime for the three cases?

-- glen

From: Terje Mathisen on 18 Sep 2005 09:30

glen herrmannsfeldt wrote:

> Terje Mathisen wrote:
>
> (snip)
>
>> The x86 INTO style single-byte trap would be the best possible way to
>> handle it as long as you also had the fast user-level trap handler you
>> wished you had been able to fit into MIPS architecture:
>
> How much difference does it make at run time, though.
>
> In the case of a conditional branch based on the overflow bit,
> INTO trap instruction, or overflow trap without any added instructions,
> nothing after that can be committed until the status bit is available.
>
> I would assume that a branch on overflow would be predicted as not
> taken.
>
> Other than the cost of the actual instruction bits, which are pretty
> variable over different architectures, would you expect any difference
> at runtime for the three cases?

The only important difference is that INTO is effectively a
predicted-not-taken call, making it very easy to return to the correct
point after fixup.

To replace this with an inline branch instruction would require the
capability to save the next (i.e. return) address.

If your architecture can do this, then by all means use an inline
branch-and-link instead of the INTO style check!

The only real difference in this case would be that the branch opcode
probably takes 4 bytes, vs the single byte of INTO. However, as long as
you have fixed 16 or 32-bit instruction sizes anyway, it really doesn't
matter.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Terje Mathisen on 18 Sep 2005 09:41

John Mashey wrote:
> Terje Mathisen wrote:
>>It doesn't really matter:
>
> Well, actually it does (in general, perhaps not in this case), which is

Oh, absolutely.

> why I asked.
> Statistical frequencies *always* matter, at least in the sense that
> while it may not matter much if something happens 45 or 50% of the
> time, it may matter whether it happens 10% or 1% or .1% or .0000001%.

Right.

> Anyway, philosophically, I suspect that in this case you are right,
> BUT:
> a) Human intuition is notoriously bad about such things.

Which is why one of the very first pieces of asm code I ever wrote was a
routine to interrogate the (keyboard) timer chip on a PC, allowing ~
microsecond resolution timings.

After 25+ years of performance-optimized programming, I still get
regularly surprised when the actual performance of some pice of code
doesn't conform to what I initially guessed it would be. :-)

The most serious one was a couple of years ago when I wrote asm code for
the inner loop of one of the EAS contenders, and didn't get even close
to (i.e within a factor of 1.5) the speedup I expected.

OTOH, after implementing all of it (together with a couple of other
guys), including the outer loop & other housekeeping stuff, the final
speedup was 3X the best C code, which was better than my initial estimate.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit