From: Stephen Fuld on 19 Sep 2005 17:09 "John Mashey" <old_systems_guy(a)yahoo.com> wrote in message news:1127156812.929231.226160(a)f14g2000cwb.googlegroups.com... snip > People have sometimes used sticky overflow bits. To me, the hierarchy > is: > a) Precise exceptions: you know exactly where the exception occurred. > b) Sticky bits, you don't know exactly where, but get some bound. > c) No sticky-bits, explicit tests required everywhere, and hence, in > many cases, the test are omitted for speed and code size. Is it totally ridiculous to have a design that kept the instruction address "with" the instruction as it is executed? Then, no matter in what order the instructions were executed,or the state of other instructions or parts of the CPU, the CPU could report the exact address of the instruction causing the exception. It would certainly require an additional internal register (to hold the instruction address) within each FU and additional space in the renameing table, etc. but they could be written in parallel with the other stuff that had to be written so it may not cause any additional time. The only time these registers would be read is upon exeptions. -- - Stephen Fuld e-mail address disguised to prevent spam
From: glen herrmannsfeldt on 20 Sep 2005 00:16 Stephen Fuld wrote: (snip regarding overflow and possibly imprecise interrupts) > Is it totally ridiculous to have a design that kept the instruction address > "with" the instruction as it is executed? Then, no matter in what order the > instructions were executed,or the state of other instructions or parts of > the CPU, the CPU could report the exact address of the instruction causing > the exception. It would certainly require an additional internal register > (to hold the instruction address) within each FU and additional space in the > renameing table, etc. but they could be written in parallel with the other > stuff that had to be written so it may not cause any additional time. The > only time these registers would be read is upon exeptions. With large address space machines that is a lot of extra state to hold, but it could be done. The only machine I know in some detail with imprecise interrupts is the 360/91. It was supposed to fit within the S/360 interrupt model which only has a place for one address, which was the address of the next instruction to be executed. When an interrupt happens all partially executed instructions must be finished before the interrupt is taken. This can result in more exceptions, resulting in the dreaded multiple imprecise interrupt. There are indicators for which exceptions occurred, but not how many of each or where. -- glen
From: John Mashey on 20 Sep 2005 02:41 Stephen Fuld wrote: > "John Mashey" <old_systems_guy(a)yahoo.com> wrote in message > news:1127156812.929231.226160(a)f14g2000cwb.googlegroups.com... > > snip > > > People have sometimes used sticky overflow bits. To me, the hierarchy > > is: > > a) Precise exceptions: you know exactly where the exception occurred. > > b) Sticky bits, you don't know exactly where, but get some bound. > > c) No sticky-bits, explicit tests required everywhere, and hence, in > > many cases, the test are omitted for speed and code size. > > Is it totally ridiculous to have a design that kept the instruction address > "with" the instruction as it is executed? Then, no matter in what order the > instructions were executed,or the state of other instructions or parts of > the CPU, the CPU could report the exact address of the instruction causing > the exception. It would certainly require an additional internal register > (to hold the instruction address) within each FU and additional space in the > renameing table, etc. but they could be written in parallel with the other > stuff that had to be written so it may not cause any additional time. The > only time these registers would be read is upon exeptions. Of course not (totally ridiculous, that is), as PCs have to be kept around (in some form or other) anyway, but implementation always matters. 1) As a reminder, not all CPU designs are complex speculative OOO CPUs. In fact, only a tiny fraction of distinct CPU designs are such. Some extra stuff doesn't cost much in a big OOO CPU, but percentage-wise, it may cost more in a simple pipeline or even a relatively simple 2-issue in-order superscalar. In some ways, an OOO design handles exceptions "easier" than the other designs, in that most instructions are executed speculatively anyway, and one needs all of the mechanisms for in-order graduation and unwinding mispredicted code anyway. But consider tjhe (common) designs with the following characteristics: - in-order issue - multiple functional units, with long latencies and potential out-of-order completion One may track the PC for each instruction, but suppose there are 4 independent FUs (like integer add, integer mul/divFP add/mul, FP divide/sqrt), of which the latter 3 naturally have multi-cycle operations, some very long, and of course, there may be queueing effects on FUs. Suppose your code looks like: 1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL ..... instructions not depending on f1 2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks] ..... instructions not depending on f2 2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU] ..... instructions not depending on r1 4: INT ADD r2 = r3*r4 and this clobbers r2 Suppose one has an ISA in which every one of these can cause an exception [on MIPS, INT DIV can't, but the rest can.] THE GOOD CASE If the ISA semantics follow the rules I described earlier a) FP DIV and FP MUL stall until they are sure they don't cause an exception. Then they run to completion. b) The INT DIV stalls until it does a test-for-zero, and then it runs as long as necessary. It's quite possible that 1, 2, and 3 are all executing concurrently. Then: 1) The behavior is identical on all implementations regardless of the number and latency of functional units. 2) Likewise, if the OS has to redo the offending instruction and continue execution, it can, relatively straightforwardly. Why would would one want to do that? Isn't an exception The End? Nope: in particular, the FPU might not implement all of IEEE 754, it might implement the common cases, and trap on data-dependent, but infrequent cases [as has been done in at least some implementations of both MIPS and SPARC]. In any case, the IEEE standard recommends that users be allowed to specify a trap handler for any of the standard exceptions, such that it can compute a substitute result. THE HARD CASE Assume that 1-4 of the instructions above could raise an exception upon completion, not by stalling. [As faras I can tell from a quick read, HP PA RISC is somewhat like this.] Here's are some exercises for the reader: 1) Enumerate the number and order of exceptions that might happen. 2) Write the IEEE trap handlers that handle traps from the FP operations. Note that if 1 traps, but late enough, one of its inputs is already overwritten by the output of 2, which means the trap handler can't get at the inputs. 2a) Design the logic to keep track of the dependencies above, i.e., requiring the tracking of any register used as an *input* by any uncompleted operation, and stalling any operation that modifies such a register (which of course, wrecks the very concurrency that one is trying to get). 3) When a trap occurs, does the CPU complete any pending operations? Suppose one of them causes an exception also? Design all of the state provided to the kernel, and to the user trap handler. 4) In a family of CPUs, does the outcome of a program depend on the relative latencies of various functional units? Discuss the cirumstances under which this might be OK. Note that the answer to 1 is: 0-4 exceptions, and in any order, depending on the latencies, and also on the number of intervening instructions. MESSAGE: for some reason, many people want to propose complex mechanisms. I say again: especially in this turf, simplicty really pays off, and complexity is only accepted reluctantly, at least by people who actually have to implement the hardware and software. One can make complex setups work, but it's expensive.
From: glen herrmannsfeldt on 20 Sep 2005 17:38 John Mashey wrote: (snip) > Suppose your code looks like: > 1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL > .... instructions not depending on f1 > 2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks] > .... instructions not depending on f2 > 2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU] > .... instructions not depending on r1 > 4: INT ADD r2 = r3*r4 and this clobbers r2 Don't most processors use register renaming when a register is overwritten like this? There has to be some way to keep track of which value is going where. Getting the right value into the real register for the interrupt would be an extra challenge, though. -- glen
From: Peter Dickerson on 21 Sep 2005 02:37
"glen herrmannsfeldt" <gah(a)ugcs.caltech.edu> wrote in message news:yM2dnfGDOKa1HK3eRVn-hw(a)comcast.com... > John Mashey wrote: > > (snip) > > > Suppose your code looks like: > > > 1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL > > .... instructions not depending on f1 > > 2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks] > > .... instructions not depending on f2 > > 2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU] > > .... instructions not depending on r1 > > 4: INT ADD r2 = r3*r4 and this clobbers r2 > > Don't most processors use register renaming when a register is > overwritten like this? There has to be some way to keep track > of which value is going where. > > Getting the right value into the real register for the interrupt > would be an extra challenge, though. > > -- glen I thought JM had explicitly included simple in-order pipelined processors. Such microarchitectures don't normally rename. Peter |