interrupting for overflow and loop termination [Computer Architecture]

Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit

From: Stephen Fuld on 19 Sep 2005 17:09

"John Mashey" <old_systems_guy(a)yahoo.com> wrote in message
news:1127156812.929231.226160(a)f14g2000cwb.googlegroups.com...

snip

> People have sometimes used sticky overflow bits. To me, the hierarchy
> is:
> a) Precise exceptions: you know exactly where the exception occurred.
> b) Sticky bits, you don't know exactly where, but get some bound.
> c) No sticky-bits, explicit tests required everywhere, and hence, in
> many cases, the test are omitted for speed and code size.

Is it totally ridiculous to have a design that kept the instruction address
"with" the instruction as it is executed? Then, no matter in what order the
instructions were executed,or the state of other instructions or parts of
the CPU, the CPU could report the exact address of the instruction causing
the exception. It would certainly require an additional internal register
(to hold the instruction address) within each FU and additional space in the
renameing table, etc. but they could be written in parallel with the other
stuff that had to be written so it may not cause any additional time. The
only time these registers would be read is upon exeptions.

--
- Stephen Fuld
e-mail address disguised to prevent spam

From: glen herrmannsfeldt on 20 Sep 2005 00:16

Stephen Fuld wrote:
(snip regarding overflow and possibly imprecise interrupts)

> Is it totally ridiculous to have a design that kept the instruction address
> "with" the instruction as it is executed? Then, no matter in what order the
> instructions were executed,or the state of other instructions or parts of
> the CPU, the CPU could report the exact address of the instruction causing
> the exception. It would certainly require an additional internal register
> (to hold the instruction address) within each FU and additional space in the
> renameing table, etc. but they could be written in parallel with the other
> stuff that had to be written so it may not cause any additional time. The
> only time these registers would be read is upon exeptions.

With large address space machines that is a lot of extra state to
hold, but it could be done.

The only machine I know in some detail with imprecise interrupts is
the 360/91. It was supposed to fit within the S/360 interrupt model
which only has a place for one address, which was the address of the
next instruction to be executed. When an interrupt happens all
partially executed instructions must be finished before the interrupt
is taken. This can result in more exceptions, resulting in the dreaded
multiple imprecise interrupt. There are indicators for which exceptions
occurred, but not how many of each or where.

-- glen

From: John Mashey on 20 Sep 2005 02:41

Stephen Fuld wrote:
> "John Mashey" <old_systems_guy(a)yahoo.com> wrote in message
> news:1127156812.929231.226160(a)f14g2000cwb.googlegroups.com...
>
> snip
>
> > People have sometimes used sticky overflow bits. To me, the hierarchy
> > is:
> > a) Precise exceptions: you know exactly where the exception occurred.
> > b) Sticky bits, you don't know exactly where, but get some bound.
> > c) No sticky-bits, explicit tests required everywhere, and hence, in
> > many cases, the test are omitted for speed and code size.
>
> Is it totally ridiculous to have a design that kept the instruction address
> "with" the instruction as it is executed? Then, no matter in what order the
> instructions were executed,or the state of other instructions or parts of
> the CPU, the CPU could report the exact address of the instruction causing
> the exception. It would certainly require an additional internal register
> (to hold the instruction address) within each FU and additional space in the
> renameing table, etc. but they could be written in parallel with the other
> stuff that had to be written so it may not cause any additional time. The
> only time these registers would be read is upon exeptions.

Of course not (totally ridiculous, that is), as PCs have to be kept
around (in some form or other) anyway, but implementation always
matters.

1) As a reminder, not all CPU designs are complex speculative OOO CPUs.
In fact, only a tiny fraction of distinct CPU designs are such. Some
extra stuff doesn't cost much in a big OOO CPU, but percentage-wise, it
may cost more in a simple pipeline or even a relatively simple 2-issue
in-order superscalar.

In some ways, an OOO design handles exceptions "easier" than the other
designs, in that most instructions are executed speculatively anyway,
and one needs all of the mechanisms for in-order graduation and
unwinding mispredicted code anyway. But consider tjhe (common) designs
with the following characteristics:
- in-order issue
- multiple functional units, with long latencies and potential
out-of-order completion

One may track the PC for each instruction, but suppose there are 4
independent FUs (like integer add, integer mul/divFP add/mul, FP
divide/sqrt), of which the latter 3 naturally have multi-cycle
operations, some very long, and of course, there may be queueing
effects on FUs.

Suppose your code looks like:

1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL
..... instructions not depending on f1
2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks]
..... instructions not depending on f2
2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU]
..... instructions not depending on r1
4: INT ADD r2 = r3*r4 and this clobbers r2

Suppose one has an ISA in which every one of these can cause an
exception [on MIPS, INT DIV can't, but the rest can.]

THE GOOD CASE
If the ISA semantics follow the rules I described earlier
a) FP DIV and FP MUL stall until they are sure they don't cause an
exception. Then they run to completion.
b) The INT DIV stalls until it does a test-for-zero, and then it runs
as long as necessary. It's quite possible that 1, 2, and 3 are all
executing concurrently.

Then:
1) The behavior is identical on all implementations regardless of the
number and latency of functional units.
2) Likewise, if the OS has to redo the offending instruction and
continue execution, it can, relatively straightforwardly. Why would
would one want to do that? Isn't an exception The End?
Nope: in particular, the FPU might not implement all of IEEE 754, it
might implement the common cases, and trap on data-dependent, but
infrequent cases [as has been done in at least some implementations of
both MIPS and SPARC].

In any case, the IEEE standard recommends that users be allowed to
specify a trap handler for any of the standard exceptions, such that it
can compute a substitute result.

THE HARD CASE
Assume that 1-4 of the instructions above could raise an exception upon
completion, not by stalling. [As faras I can tell from a quick read,
HP PA RISC is somewhat like this.]

Here's are some exercises for the reader:

1) Enumerate the number and order of exceptions that might happen.

2) Write the IEEE trap handlers that handle traps from the FP
operations.
Note that if 1 traps, but late enough, one of its inputs is already
overwritten by the output of 2, which means the trap handler can't get
at the inputs.

2a) Design the logic to keep track of the dependencies above, i.e.,
requiring the tracking of any register used as an *input* by any
uncompleted operation, and stalling any operation that modifies such a
register (which of course, wrecks the very concurrency that one is
trying to get).

3) When a trap occurs, does the CPU complete any pending operations?
Suppose one of them causes an exception also? Design all of the state
provided to the kernel, and to the user trap handler.

4) In a family of CPUs, does the outcome of a program depend on the
relative latencies of various functional units? Discuss the
cirumstances under which this might be OK.

Note that the answer to 1 is: 0-4 exceptions, and in any order,
depending on the latencies, and also on the number of intervening
instructions.

MESSAGE: for some reason, many people want to propose complex
mechanisms. I say again: especially in this turf, simplicty really
pays off, and complexity is only accepted reluctantly, at least by
people who actually have to implement the hardware and software. One
can make complex setups work, but it's expensive.

From: glen herrmannsfeldt on 20 Sep 2005 17:38

John Mashey wrote:

(snip)

> Suppose your code looks like:

> 1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL
> .... instructions not depending on f1
> 2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks]
> .... instructions not depending on f2
> 2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU]
> .... instructions not depending on r1
> 4: INT ADD r2 = r3*r4 and this clobbers r2

Don't most processors use register renaming when a register is
overwritten like this? There has to be some way to keep track
of which value is going where.

Getting the right value into the real register for the interrupt
would be an extra challenge, though.

-- glen

From: Peter Dickerson on 21 Sep 2005 02:37

"glen herrmannsfeldt" <gah(a)ugcs.caltech.edu> wrote in message
news:yM2dnfGDOKa1HK3eRVn-hw(a)comcast.com...
> John Mashey wrote:
>
> (snip)
>
> > Suppose your code looks like:
>
> > 1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL
> > .... instructions not depending on f1
> > 2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks]
> > .... instructions not depending on f2
> > 2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU]
> > .... instructions not depending on r1
> > 4: INT ADD r2 = r3*r4 and this clobbers r2
>
> Don't most processors use register renaming when a register is
> overwritten like this? There has to be some way to keep track
> of which value is going where.
>
> Getting the right value into the real register for the interrupt
> would be an extra challenge, though.
>
> -- glen

I thought JM had explicitly included simple in-order pipelined processors.
Such microarchitectures don't normally rename.

Peter

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit