interrupting for overflow and loop termination [Computer Architecture]

Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit

From: John Mashey on 14 Sep 2005 06:09

Seongbae Park wrote:
> John Mashey <old_systems_guy(a)yahoo.com> wrote:
> ...
> > You hardware guys are all alike [in hating sign-extension on loads]
> >:-).

> Well, if the sign-extend version takes more cycles than zero-extend
> - I suppose your second choice meant such a case -
> it creates the same funny optimization hassle
> and such an optimization accompanies occasional bug reports that cry wolf
> over the zero-extend load that correctly replaced sign-extend load
> ("It's a signed char in my code.
> Why is the compiler using a zero-extend load ?
> The compiler must be buggy!").

Yes, but the complaints are much worse when people disassemble code and
see a bunch of EXTs that are clearly unnecessary, i.e., visible
instructions almost always get more attention/flak/whinging than slow
instructions, unfortunately. I spent some time tuning a 68K compiler
years ago at Convergent, and this kind of thing came up, and it wasn't
trivial to fix at the time, and get it right, at least in pcc.

From: John Mashey on 14 Sep 2005 06:22

Seongbae Park wrote:

> I think LISP/Smalltalk/ADA market is just too small to justify
> adding any significant change in the general purpose ISA,
> unless this yet-to-be-invented mechanism is easy and cheap
> to implement or it is for some other purpose which happened to
> help them (like a fast user-level trap).

Well, that's why we never did it. We certainly couldn't justify
expensive features for that market, but we hoped to find modest useful
ones that might be general enough to have other uses as well. Maybe if
we could have afforded another 6 months to do the original MIPS-I ISA,
we might have thought of something reasonable, but after that, it was
probably too late. Nothing very complex would have fit in the R2000 in
any case, although I would have given up a few TLB entries had we
gotten a good solution here.

From: Scott A Crosby on 14 Sep 2005 15:21

On 13 Sep 2005 08:33:17 -0700, "John Mashey" <old_systems_guy(a)yahoo.com> writes:

> I wished for something general enough to:

> a) Fix alignment errors, i.e., one would like to be able to run a
> binary with/without alignment checking. [Recall that MIPS could handle
> alginment errors, but needed a recompile to use LWL/LWR, etc].

> b) Be able to trap unimplemented instructions, i.e., like
> floating-point operations on original MIPS R2000, before the FPU was
> available, or for machines that didn't have one, rather than doing
> coprocessor-unusable traps. Also, one might do not-yet-implemented
> instructions, like sqrt (which was not there in MIPS-I, but added
> later). One might consider doing integer mul/div this way, where some
> designs had them, and some didn't.
> c) Likewise, support for parts of IEEE FP that one didn't want to do in
> hardware.

> A) Managing binary compatibility across a family whose implemented
> features vary. Note that a good mechanism would let you run binaries
> with new instructions on old systems, given the right emulation code.

About a month ago, during a discussion on mul/div on SPARC, someone
here suggested what I thought was a cute technique for doing
this. What happens is when the CPU tries to run an illegal instruction
and traps, the kernel backpatches the executable to jump to an
appropriate emulation routine. The compiler is required to always
follow such a not-universally-implemented instruction with enough
no-ops so there's always room for the back-patch. However, if the
binary is targetted only for hardware with the instruction, the
compiler isn't required to generate the no-ops.

The ABI is such that all binaries are linked with an appropriate
emulation library for the kernel to backpatch jumps to point to. The
no-op space overhead might be reduced if the ISA included a special
save&jump instructions designed for this purpose.

On old hardware there's no loss in performance, and the kernel only
gets involved with one trap once for each instruction, not once for
each execution of an unsupported instruction. And on new hardware the
cost is a few extra no-ops. Software targetting new hardware only
doesn't even pay the no-op overhead.

Scott

From: Eliot Miranda on 14 Sep 2005 17:59

John Mashey wrote:
> David Hopwood wrote:
>
>>andrewspencers(a)yahoo.com wrote:
>>
>>>Terje Mathisen wrote:
>
>
>>A slightly different situation is where you have code that in practice
>>always handles integers that fit in a single word, but that can't be
>>statically guaranteed to do so, and the language specification says that
>>bignum arithmetic must be supported -- the obvious example being Smalltalk.
>>There were some attempts to support this in hardware (e.g. "Smalltalk on
>>a RISC"; also something on SPARC that I can't remember the details of),
>>but it turned out to be easier and faster for implementations of Smalltalk
>>and similar languages to use other tricks that don't require hardware support.
>
>
> Yes.
> 1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
> which used the high-30 bits as integers, and the low 2 bits as tags; if
> either low 2-bit field were non-zero, it trapped.
>
> 2) ~1988, while working on MIPS-II, I/we spent a lot of time talking
> with Smalltalk & LISP friends, potential customers, etc, asking:
> "Are there any modest extensions that would help you a lot, and would
> be reasonable to implement?
>
> Short answer: NO.
>
> Longer answer:
> a) They said either give them a complete, tailored solution [which they
> didn't expect], or just make the CPU run fast, but don't bother with
> minor enhancements. Some said they knew about the SPARC feature, but
> didn't use it.

This would include Peter Deutsch and the design of HPS his 2nd dynamic
translation (JIT) VM. The tag pattern for immediate integers was
already chosen to be 11, and changing it just for SPARC when the
performance boost would be below 10% in all but micro-bencmarks just
isn't worth it. However, were the SPARC designers to have allowed the
trap mask to be a variable part of per-thread state, or even better, to
be specified in the instruction itself (eliminating problems combining
different language implementations in one program) then we would have
made use of it (certainly code exists to use it).

The most convenient design would be not a trap but a branch or skip.
Something like "add and skip on overflow or if either operand's tag
pattern doesn't match X". Now with 64-bit implementations one would
also want to specify the width of the tag field (one bit would suit HPS;
its 32-bit and 64-bit implementations use a single bit to tag immediate
integers.

> b) Some said: they were all doing fairly portable versions, had learned
> a lot of good tricks, and minor improvements that required major
> structural changes just weren't worth it.

hence the need for any instructions to provide flexibility and not
dictate particular bit patterns...

[snip]

> Anyway, it's pretty clear that relevant mechanisms were being discussed
> ~20 years ago, but nobody seems to have figured out features that
> actually make implementation sense. I'd be delighted to see a
> well-informed proposal that had sensible hardware/software
> implementations and really helped LISP/Smalltalk/ADA and hopefully
> other languages...

We could use a tagged add/sub and skip on overflow or tag mismatch, and
a tagged compare and skip on tag mismatch, where the tag field can be
flexibly specified to suit both 32-bit implementations (typical tags
least significant two bits) and 64-bit implementations (typical tags
least significant three or four bits).

If 6 bits were dedicated to the tag specification, two would be the size
of the tag field
00 -> least significant bit
01 -> least significant two bits
10 -> least significant three bits
11 -> least significant four bits
The remaining four bits would specify the required tag pattern, bits
excess to the tag size being ignored.

The two operands would be interpreted as 2's complement signed integers
in the remaining non-tag bits. The add/sub instructions would skip or
annul the following instruction if either operand's tag pattern didn't
match the tag specification or if the result overflowed. The compare
instructions would skip or annul the following instruction if either
operand's tag pattern didn't match the tag specification.

The value of the result register of the tagged add/sub would have the
same tag pattern as the operands. Result value is undefined if overflow
or tag mismatch (i.e. I don't think one would typically be interested in
the result).

Code sequences for polymorphic add/sub or compare would then look like

fetch operand one
fetch operand two
tagged add/sub
branch Ldone
code for non-tagged case (method lookup)
...
Ldone:

Code for compare sequences would depend on whether one needed to take a
conditional branch or produce a result. So one could use an instruction
that would skip the next two instructions on tag mismatch.

If tagged compare skips the next instruction then tagged compare for a
conditional branch might look like
fetch operand one
fetch operand two
tagged compare
branch Lcond
code for non-tagged case (method lookup)
...
compare result of non-tagged compare against TRUE value
branch if equal Ltrue
compare result of non-tagged compare against FALSE value
branch if equal Lfalse
call notBooleanError
Lcond: branch on equal Ltrue
Lfalse:

If it skips the following two instructions then
fetch operand one
fetch operand two
tagged compare
branch if equal Ltrue
branch Lfalse
code for non-tagged case (method lookup)
...
compare result of non-tagged compare against TRUE value
branch if equal Ltrue
compare result of non-tagged compare against FALSE value
branch if equal Lfalse
call notBooleanError

which isn't much of a saving...

One could also make use of a tagged add/sub immediate as there's a high
dynamic frequency of var + 1 in most (Smalltalk) programs. The
immediate value would omit the tag pattern and be shifted by the tag
size to increase useful range.
--
_______________,,,^..^,,,____________________________
Eliot Miranda Smalltalk - Scene not herd

From: John Mashey on 15 Sep 2005 01:00

Eliot Miranda wrote:
> John Mashey wrote:

> > Longer answer:
> > a) They said either give them a complete, tailored solution [which they
> > didn't expect], or just make the CPU run fast, but don't bother with
> > minor enhancements. Some said they knew about the SPARC feature, but
> > didn't use it.
>
> This would include Peter Deutsch and the design of HPS his 2nd dynamic
> translation (JIT) VM.

Lots of good details deleted...

1) The suggestions would probably fit HP PA better than MIPs, as it has
extensive "annul-next-instruction" features.

2) My comment above was indeed a paraphrase of Peter's comments,
although somewhat similar thoughts came from others as well.

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Intel x86 memory model question
Next: C++: 64 bit performance vs. 32 bit