interrupting for overflow and loop termination [Computer Architecture]

Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit

From: Iain McClatchie on 13 Sep 2005 05:10

Mash> There is a great deal of pushback in introducing features that
Mash> might add gate delays in awkward places, of which two are:
Mash> a) Something only computable on the *output* of an ALU
Mash> operation
Mash> b) The result of a load operation

Mash> In many implementations, such paths may be among the critical
Mash> paths. Sometimes, the need to get a trap indication from an
Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit
Mash> may create a long wire that causes serious angst, or yelling
Mash> in design meetings.

Hmm... a feature that hangs some logic on the output of the ALU or
load pipe, and causes a pipe flush and IF retarget if the logic
detects some condition.

I don't think this is a problem, Mash. We're already doing this
for integer overflow and various floating-point exceptions. Suppose
for a moment that the additional complexity of the feature added a
pipe stage to this recurrence... in an OoO core, who cares? GPR
writeback is unaffected, you just have more logic writing to the tag
bits in the reorder buffer.

It's not like we're going to see one or more exception per every 1000
instructions... right?

Now what would be very unpopular with the CPU guys would be
instructions that monkey around with the dataflow inside the ALU.
I skimmed the description of the Sparc tagged adds, but they
sounded like just the kind of thing I'd want to kick out of the
hardware, because getting data through the ALU really is the
common case.

Heck, I'd like to get rid of sign extension on loads. In an earlier
proposal, I wanted to bolt an ALU (including shifter) onto the end of
the load pipe, so that the op after the load could be scheduled with
the load in one go. The trouble is that raw pointer chasing is just
too popular, and you don't want the load pipe latency dinking back
and forth between two values.

Side note: earlier in this thread people seemed to be having trouble
with the difference between jumps/branches and exceptions. On OoO
CPUs, there is one relevant distinction: predicted versus nonpredicted
control flow. For instance, it might be totally reasonable for the
processor to predict TLB faults on certain load instructions, and
avoid the double pipe flush by predicting the exception.

So... exceptions to get out of loops is not changing the problem
that the core faces.

Now, a separate issue is how that control flow is encoded. It is
definitely the case that instruction fetch engines are having a
great deal of difficulty with all these branches. Once predicted,
verifying the predictions is actually not too bad, which is why trace
caches are so enticing.

From: JJ on 13 Sep 2005 16:48

John Mashey wrote:
> David Hopwood wrote:
> > andrewspencers(a)yahoo.com wrote:
> > > Terje Mathisen wrote:
>
> > A slightly different situation is where you have code that in practice
> > always handles integers that fit in a single word, but that can't be
> > statically guaranteed to do so, and the language specification says that
> > bignum arithmetic must be supported -- the obvious example being Smalltalk.
> > There were some attempts to support this in hardware (e.g. "Smalltalk on
> > a RISC"; also something on SPARC that I can't remember the details of),
> > but it turned out to be easier and faster for implementations of Smalltalk
> > and similar languages to use other tricks that don't require hardware support.
>

snipping

>
> Anyway, it's pretty clear that relevant mechanisms were being discussed
> ~20 years ago, but nobody seems to have figured out features that
> actually make implementation sense. I'd be delighted to see a
> well-informed proposal that had sensible hardware/software
> implementations and really helped LISP/Smalltalk/ADA and hopefully
> other languages...

I suspect in current single threaded processor designs, clock to the
max, with current cache model, such a proposal would be hard to come by
and justify esp when the memory wall forces such extreme locality of
reference and so many wait states.

A processor designed solely around communicating sequential processes
running on multiple MTAs can fairly well hide memory latency (well
known).

By sharing a high issue rate RLDRAM with say 200M-400M interleaved load
stores per sec driven by a nice hash box to destroy all locality of
reference from numerous PE requests, and to reduce bank collisions to
random chance, object support comes naturally. The hashing takes 32b
Object-MMU IDs and hashes with 32b linear index to the particular PA
size. Object IDs are generated by new[] using a PRNG. MMU IDs are
enumerated at boot time over Links. A 32MByte RLDRAM can appear to
store upto 1M single line objects, more typically <<100k objects of all
types and sizes. By trading space for rehashes, performance can be kept
good. Message object IDs are passed around through channels syncronized
by !,?. Besides occam support, ADA, Lisp, Smalltalk support comes to
mind all the time.

Object support in hardware to a very fine grain level (32 byte pages or
lines) with full protections of all object lines. It makes lists,
sparse arrays, hash tables a snap, all fit right on top of each other
all Mashed up as long as memory is <say 70% full. The MMU model can be
tested out in a compiler for its own object store but this test is only
single threaded.

For more performance the scheme can be replicated at lower <ns and
higher 50ns levels for raw flat memory thoughput or volume. At the sub
ns level, it allows say 16 way interleaved N cycle concurrent SRAM
banks to appear to have performance of MMU issue box even with
relatively slow SRAMs (or maybe even 5ns DRAM). At the other end, the
SDRAM controller has little throughput but latency is only a few times
that of RLDRAM.

You takes the wait states from few huge processors or numerous hardware
threads from many simple processors, I'll take many threads anytime. In
this scheme its the MMU thats really interesting, the PEs are just
little grunt boxes to generate enough memory requests to keep MMU near
100%. Even the PE ISA doesn't matter much a 486 RISC ISA would work as
well as anything else with the extra par support.

Anyway I will describe it at cpa2005 for anyone interested

johnjakson at usa dot ...

From: John Mashey on 13 Sep 2005 20:43

Iain McClatchie wrote:
> Mash> There is a great deal of pushback in introducing features that
> Mash> might add gate delays in awkward places, of which two are:
> Mash> a) Something only computable on the *output* of an ALU
> Mash> operation
> Mash> b) The result of a load operation
>
> Mash> In many implementations, such paths may be among the critical
> Mash> paths. Sometimes, the need to get a trap indication from an
> Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit
> Mash> may create a long wire that causes serious angst, or yelling
> Mash> in design meetings.
>
> Hmm... a feature that hangs some logic on the output of the ALU or
> load pipe, and causes a pipe flush and IF retarget if the logic
> detects some condition.
>
> I don't think this is a problem, Mash. We're already doing this
> for integer overflow and various floating-point exceptions. Suppose
> for a moment that the additional complexity of the feature added a
> pipe stage to this recurrence... in an OoO core, who cares? GPR
> writeback is unaffected, you just have more logic writing to the tag
> bits in the reorder buffer.

Of course (i.e., it might not matter in an OoO), but you may have
missed the careful weasel-words "In many implementations". After all,
of the horde of distinct pipeline implementations that have ever
existed, only a tiny fraction are OoO...

For what it's worth, there was some argument about this (overflow in
R2000) in 1985, because it was literally the *only* integer exception
that needed to be detected after the ALU stage, and in time to inhibit
register writeback, and somebody was worried about a possible extra
delay for a while.

> Now what would be very unpopular with the CPU guys would be
> instructions that monkey around with the dataflow inside the ALU.
> I skimmed the description of the Sparc tagged adds, but they
> sounded like just the kind of thing I'd want to kick out of the
> hardware, because getting data through the ALU really is the
> common case.
Again, I don't think the SPARC tagged ops are so bad, because they just
look at two bits each of the two inputs, so one can detect the trap
early.

>
> Heck, I'd like to get rid of sign extension on loads. In an earlier
> proposal, I wanted to bolt an ALU (including shifter) onto the end of
> the load pipe, so that the op after the load could be scheduled with
> the load in one go. The trouble is that raw pointer chasing is just
> too popular, and you don't want the load pipe latency dinking back
> and forth between two values.

You hardware guys are all alike [in hating sign-extension on loads]
:-).
We seriously looked at various schemes found elsewhere, i.e., where one
loads zero-extended partial-word data, and then uses an explicit EXT to
sign-extend. We had enough data to prefer having both zero-extend and
sign-extend as operations, and if push had really come to shove, I
would have lived with an explicit EXT, although having done 68K
compiler work, and dealt with some of the funny optimization hassles
(i.e., can one get correct results without the EXT, sometimes?) I
certainly preferred to have the signed-load opcodes as first choice.
My second choice would have been 2-cycle load-signeds. Third choice
was the explicit EXT.

From: Seongbae Park on 13 Sep 2005 22:51

John Mashey <old_systems_guy(a)yahoo.com> wrote:
....
> You hardware guys are all alike [in hating sign-extension on loads]
>:-).

I haven't met a hardware guy who likes that, either.

> We seriously looked at various schemes found elsewhere, i.e., where one
> loads zero-extended partial-word data, and then uses an explicit EXT to
> sign-extend. We had enough data to prefer having both zero-extend and
> sign-extend as operations, and if push had really come to shove, I
> would have lived with an explicit EXT, although having done 68K
> compiler work, and dealt with some of the funny optimization hassles
> (i.e., can one get correct results without the EXT, sometimes?)
> I certainly preferred to have the signed-load opcodes as first choice.
> My second choice would have been 2-cycle load-signeds.

Well, if the sign-extend version takes more cycles than zero-extend
- I suppose your second choice meant such a case -
it creates the same funny optimization hassle
and such an optimization accompanies occasional bug reports that cry wolf
over the zero-extend load that correctly replaced sign-extend load
("It's a signed char in my code.
Why is the compiler using a zero-extend load ?
The compiler must be buggy!").

And since ISAs usually don't define exact cycles nor they require
two operations to take same number of cycles or issue/execution/etc resources,
implementations of ISAs that have both versions
tend to take an extra cycle for sign-extend load.

> Third choice was the explicit EXT.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"

From: Nick Maclaren on 14 Sep 2005 03:32

In article <1126658582.506368.173210(a)g43g2000cwa.googlegroups.com>,
John Mashey <old_systems_guy(a)yahoo.com> wrote:
>
>For what it's worth, there was some argument about this (overflow in
>R2000) in 1985, because it was literally the *only* integer exception
>that needed to be detected after the ALU stage, and in time to inhibit
>register writeback, and somebody was worried about a possible extra
>delay for a while.

Why on earth was that? I.e. why should it need to inhibit register
writeback? MIPS is twos complement, and the only real advantage of
that is that it enables writeback and overflow flagging to be done
in either order.

If the architecture specified that writeback did not occur if overflow
occurred, then the designers weren't thinking about that aspect. It
isn't as if it wasn't an ancient problem, after all.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit