interrupting for overflow and loop termination [Computer Architecture]

Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit

From: John Mashey on 21 Sep 2005 02:46

Peter Dickerson wrote:
> "glen herrmannsfeldt" <gah(a)ugcs.caltech.edu> wrote in message
> news:yM2dnfGDOKa1HK3eRVn-hw(a)comcast.com...

> > Don't most processors use register renaming when a register is
> > overwritten like this? There has to be some way to keep track
> > of which value is going where.
> >
> > Getting the right value into the real register for the interrupt
> > would be an extra challenge, though.
> >
> > -- glen
>
> I thought JM had explicitly included simple in-order pipelined processors.
> Such microarchitectures don't normally rename.

Yes, especially since:
1) "Most" distinct CPU designs are in-order issue, whether superscalar
or not.
The fraction of OOO designs is minuscule, although of course (due to
X86) they account for a lot of $.

2) "Most" actual CPU chips are in-order, since embedded chips rarely
use OOO, and outsell PC / system chips.

Also, for this newsgroup, I'd guess that if someone actually has a
chance to participate in a CPU design, it is much more likely to be
in-order (in an FPGA, or an SoC) than an OOO chip, as the latter are
not done at very many places. The nubmer of people on the planet who
actually design OOO CPUs is a tiny fraction of the total who design
CPUs.

From: Iain McClatchie on 27 Sep 2005 23:29

Mash> THE GOOD CASE
Mash> If the ISA semantics follow the rules I described earlier
Mash> a) FP DIV and FP MUL stall until they are sure they don't cause
an
Mash> exception. Then they run to completion.

Back when I worked on this stuff ('92-'94), it seemed that most
programs did not turn on any of the user-level exception triggers.
The only common exception was caused by denormalized inputs
requiring a trap to the kernel for emulation, since the hardware
(R8000 in this case) didn't do denorms. These traps caused quite
a lot of grief.

The R8000 had a 4-cycle multiply-add pipeline. I often wonder if
we would have experienced less grief with a 5-cycle multiple-add
pipe that could do input and output denorms without exceptions.
Performance would have been lower, register pressure would have
been higher... but every application run by folks that weren't
sure whether they could turn on flush-to-zero mode would have
gone faster anyway.

I never had exposure to data that would have told me if there
were more folks (more sales dollars, really) in the can-flush-
denorms-to-zero camp than in the don't-know and must-handle-
denorms camps. But my guess is that handling denorms in hardware
would have been the better choice. In hindsight, we probably had
the area to do it, too. (You need an extra output shifter, IIRC.)

From: Jan Vorbrüggen on 28 Sep 2005 02:52

> The R8000 had a 4-cycle multiply-add pipeline. I often wonder if
> we would have experienced less grief with a 5-cycle multiple-add
> pipe that could do input and output denorms without exceptions.

Would a single cycle extension to the pipeline been enough? Was the
actual hardware (as built) already generating denorms, or does that
cause an exception as well?

Jan

From: Bernd Paysan on 28 Sep 2005 07:24

Jan Vorbr?ggen wrote:

>> The R8000 had a 4-cycle multiply-add pipeline. I often wonder if
>> we would have experienced less grief with a 5-cycle multiple-add
>> pipe that could do input and output denorms without exceptions.
>
> Would a single cycle extension to the pipeline been enough? Was the
> actual hardware (as built) already generating denorms, or does that
> cause an exception as well?

Handling denoms requires another barrel shift operation - you find out that
your result doesn't fit into the required range, so you shift it right by
the overflow exponent.

If your MAC pipeline is multiply (carry save adder network), sum, shift,
add, count leading zeros, shift (normalize), shift (denorms), it can take
up to seven or eight cycles.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

First | Prev |
Pages: 7 8 9 10 11 12 13 14 15 16 17
Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit