interrupting for overflow and loop termination [Computer Architecture]

Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit

From: glen herrmannsfeldt on 18 Sep 2005 13:02

Terje Mathisen wrote:
> glen herrmannsfeldt wrote:

(snip)

>>In the case of a conditional branch based on the overflow bit,
>>INTO trap instruction, or overflow trap without any added instructions,
>>nothing after that can be committed until the status bit is available.

>>I would assume that a branch on overflow would be predicted
>> as not taken.

>>Other than the cost of the actual instruction bits, which are pretty
>>variable over different architectures, would you expect any difference
>>at runtime for the three cases?

> The only important difference is that INTO is effectively a
> predicted-not-taken call, making it very easy to return to the correct
> point after fixup.

I was only considering the cost of the instruction, especially
in the case that it isn't taken.

> To replace this with an inline branch instruction would require the
> capability to save the next (i.e. return) address.

If you have only one in a loop you know where it is...

> If your architecture can do this, then by all means use an inline
> branch-and-link instead of the INTO style check!

But how much cost is there in not being able to retire the instruction
until the overflow status is known? Assuming the usual out of order,
but in order retire model.

> The only real difference in this case would be that the branch opcode
> probably takes 4 bytes, vs the single byte of INTO. However, as long as
> you have fixed 16 or 32-bit instruction sizes anyway, it really doesn't
> matter.

Imprecise interrupts were well hated in the days when they were
around, but consider the case where you really don't need to know
exactly where the interrupt is. Maybe special ADDIIO, ADD with
Imprecise Interrupt on Overflow instruction. You could use them in
a loop, and would know which iteration of the loop (put a barrier
instruction near the end of the loop), but not necessarily exactly
where.

Then again, maybe all you need is a sticky overflow bit. You could
do some set of calculations and test once at the end for any overflow,
and clear the sticky overflow bit at that time.

I don't see conditional interrupts any better than conditional
branches as far as pipelined out-of-order processors are concerned.

-- glen

From: Terje Mathisen on 18 Sep 2005 15:08

glen herrmannsfeldt wrote:

> Terje Mathisen wrote:
>> To replace this with an inline branch instruction would require the
>> capability to save the next (i.e. return) address.
>
> If you have only one in a loop you know where it is...

This would be a _very_ special case, since it would require a loop with
only a single int variable, or at least you must be able to prove that
only this int can ever overflow. In that case you can of course skip the
overflow testing.

>> If your architecture can do this, then by all means use an inline
>> branch-and-link instead of the INTO style check!
>
> But how much cost is there in not being able to retire the instruction
> until the overflow status is known? Assuming the usual out of order,
> but in order retire model.

Not much: Processing can easily continue for at least one or two cycles,
forwaring the preliminary result to the next user, and as long as the
vast majority of these ints won't actually overflow, the delayed
retirement does not correspond to any actual branch misses.

> Then again, maybe all you need is a sticky overflow bit. You could
> do some set of calculations and test once at the end for any overflow,
> and clear the sticky overflow bit at that time.

Nice idea. If overflows are really rare, then it would be a win to redo
the entire loop to save testing on every operation.
>
> I don't see conditional interrupts any better than conditional
> branches as far as pipelined out-of-order processors are concerned.

Rather the opposite, since branch handling is optimized with serious hw
resources anyway.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Andy Freeman on 19 Sep 2005 10:39

Terje Mathisen wrote:
> Eliot Miranda wrote:
> > highest dynamic frequency for integer math is likely to be for iteration.
>
> Right, it is hard to imagine how it could be otherwise.

Most checks on iteration vars can be pulled out of the loop. In most
of the
others, it can be done every <large constant> iterations.

From: Terje Mathisen on 19 Sep 2005 12:14

Andy Freeman wrote:

> Terje Mathisen wrote:
>
>>Eliot Miranda wrote:
>>
>>>highest dynamic frequency for integer math is likely to be for iteration.
>>
>>Right, it is hard to imagine how it could be otherwise.
>
>
> Most checks on iteration vars can be pulled out of the loop. In most
> of the
> others, it can be done every <large constant> iterations.

I.e. all of the JIT optimizations possible for a Java implementation to
catch overflows, can also be used to convert those variables to bigints.

By reserving the all-zero tag for integers you can even use regular
ADD/SUB operations, and MUL/DIV with just a shift fixup.

Is this how current implementations do it on x86?

Terje

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: John Mashey on 19 Sep 2005 15:06

Terje Mathisen wrote:
> glen herrmannsfeldt wrote:

> This would be a _very_ special case, since it would require a loop with
> only a single int variable, or at least you must be able to prove that
> only this int can ever overflow. In that case you can of course skip the
> overflow testing.
>
> >> If your architecture can do this, then by all means use an inline
> >> branch-and-link instead of the INTO style check!
> >
> > But how much cost is there in not being able to retire the instruction
> > until the overflow status is known? Assuming the usual out of order,
> > but in order retire model.
>
> Not much: Processing can easily continue for at least one or two cycles,
> forwaring the preliminary result to the next user, and as long as the
> vast majority of these ints won't actually overflow, the delayed
> retirement does not correspond to any actual branch misses.
>
> > Then again, maybe all you need is a sticky overflow bit. You could
> > do some set of calculations and test once at the end for any overflow,
> > and clear the sticky overflow bit at that time.
>
> Nice idea. If overflows are really rare, then it would be a win to redo
> the entire loop to save testing on every operation.

People have sometimes used sticky overflow bits. To me, the hierarchy
is:
a) Precise exceptions: you know exactly where the exception occurred.
b) Sticky bits, you don't know exactly where, but get some bound.
c) No sticky-bits, explicit tests required everywhere, and hence, in
many cases, the test are omitted for speed and code size.

As a software guy, I am, of course, biased in favor of a). As a
software/hardware guy, I also realize that sometimes one may have to
compromise, but one should never be *looking* for reasons to compromise
on a) - at most, one might reluctantly admit that introducing some
imprecision is the lesser evil, grit one's teeth and bear it.

At MIPS, I (and the rest of the OS group) were among the loudest voices
in demanding precise exceptions everywhere, most of us having had to
deal with weird cases and hardware bugs and related software bugs too
many times in past lives. In the long run, this did turn out to help
chip verification as well.

To be more specific, we always wished for:

When a user-level instruction X caused an exception, and a trap to the
OS:

a) The Exception Program Counter (EPC) was set to point at the
instruction.
b) The CAUSE register was set to indicate the reason.
c) All instructions before X had been completed (or equivalent).
d) X itself had had no user-visible side-effects, i.e., there were no
partial stores, no register writebacks, no auto-increments, no shadow
registers to be distentangled, i.e., X had no effect except to cause
the trap.

We didn't quite get that, as:
a) and b): if X was in a Branch Delay slot, the EPC pointed at X-4, and
the BD-bit was set in the CAUSE register. Although this was soemtiems
a pin, and proved to be a source of designer complaint in some later
implementations, software people viewed it as tolerable, even thought
it sometimes meant messy debugger code and trickier emulation code (as
in floating poitn emulation done on systems lacking FPUs).

c) This wasn't true from the hardware viewpoint, but the CPU provided
this illusion to the programmer, so that was viewed as OK.
Specifically, consider the
sequence:

Y: (long-running FP operation, or integer multipley/divide)
....
X, causing trap.

At the time of the trap Y, might still be running in an independent
fucntional unit. However, this was all carefully defined so that:

Either the instruction was defined so that it could not ever cause an
exception (in MIPS, the divide doesn't raise a divide-by-zero; instead,
if the compiler sees variable1/varaible2, and can't be sure variable2
is non-zero, it generates code like:
DIV rs,rt
BEQZ rt,divide-by-zero
....

In the floating point cases, the insistence on precise exceptions ended
up encouraging the innovative "check the exponent fields quickly and
stall the CPU if there is any chance of an exception" patent of Tom
Riordan's that I've mentioned before.

In any of these cases, a reference to a register that was to ber
written by one fo these asynchronous units simply caused a stall.
Interrupt code woudl save away the regular integer registers first, by
which time any lignering MUL/DIV or FP ops would likely have completed.

Of course, a "fast-user-trap" feature would have to modify some of the
above, and I still wish we'd had time to think hard enough about that
early enough [2Q85].

Anyway, the message of all this is:

Start with the goal of keeping the exception model simple and precise.
Only back off reluctantly. I've posted many a time that dealing with
exceptions and their weird interactions has long been a source of
problems, as even experienced humans miss things. Although there are
some things about MIPS that I'd do differently if I could do it over,
THIS wasn't one of them.

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit