From: Frode Vatvedt Fjeld on
Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes:

> At the ADD site:
>
> add eax,[esi]
> jo local_handler
> ...
> local_handler:
> call global_handler
> jmp return_to_normal_code
>
> Yeah, it seems like this would use 9 bytes of code, vs just 1 for
> INTO.

Right, provided the local_handler can be located within the
signed-8-bit range. If not, the INTO savings balloon to 14 bytes.
(Actually, for my particular architecture, the call will be either 3
or 6 bytes (i.e. a register-indirect call with 8 or 32 bits offset,
and perhaps even a segment override prefix) yielding the range 2+3+2=7
to 5+7+5=17 bytes vs. the 1 INTO.

BTW: does anyone actually know something about the cost of the x86
INTO (when OF=0), especially relative to a conditional branch?

--
Frode Vatvedt Fjeld
From: John Mashey on
David Hopwood wrote:
> andrewspencers(a)yahoo.com wrote:
> > Terje Mathisen wrote:

> A slightly different situation is where you have code that in practice
> always handles integers that fit in a single word, but that can't be
> statically guaranteed to do so, and the language specification says that
> bignum arithmetic must be supported -- the obvious example being Smalltalk.
> There were some attempts to support this in hardware (e.g. "Smalltalk on
> a RISC"; also something on SPARC that I can't remember the details of),
> but it turned out to be easier and faster for implementations of Smalltalk
> and similar languages to use other tricks that don't require hardware support.

Yes.
1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
which used the high-30 bits as integers, and the low 2 bits as tags; if
either low 2-bit field were non-zero, it trapped.

2) ~1988, while working on MIPS-II, I/we spent a lot of time talking
with Smalltalk & LISP friends, potential customers, etc, asking:
"Are there any modest extensions that would help you a lot, and would
be reasonable to implement?

Short answer: NO.

Longer answer:
a) They said either give them a complete, tailored solution [which they
didn't expect], or just make the CPU run fast, but don't bother with
minor enhancements. Some said they knew about the SPARC feature, but
didn't use it.

b) Some said: they were all doing fairly portable versions, had learned
a lot of good tricks, and minor improvements that required major
structural changes just weren't worth it.

c) I spent some time with David Kay (of XEROX PARC/Smalltalk fame) on
this, i.e., were there features that would be substantially helpful?

The best general idea we could come up with was:
- A general low-overhead user-level trap mechanism (something I've
wished for many times for other reasons).
- Some kind of general mask/check mechanism that could generate such
traps on particular bit combinations, either:
- on completed address generation (maybe)
- on input value to ALU operation (maybe)
- on output value from ALU (bad)
- on output fetched by load instruction (bad)

But we were not able to generate a specific-enough proposal for
something that was sure to be really useful, and could be reasonable to
implement.

In that round we did add the TRAP instructions for MIPS-II, but they
were primarily for ADA, although could be used elsewhere.

I mention this, because as often happens, people throw around ideas for
features without having much concern for *serious* implementation
issues.

Implementors would not be happy designing a mechanism:
- that seems only useful for a few specific cases
- that is easily handled by normal user code
- that introduces an extra new trap type, that requires especially
efficient handling, because it's expected to be used essentially
in-line, and not as an error indicator.
- and that may well introduce gate dealys in critical paths, even if
it's minimal hardware.

As I've posted many a time, traps are *notorious* for causing
implementation bugs in hardware or software, so people do their best
not to introduce new flavors of them unless strong evidence is provided
that they are needed or are really worth it for performance.

There is a great deal of pushback in introducing features that might
add gate delays in awkward places, of which two are:
a) Something only computable on the *output* of an ALU operation
b) The result of a load operation

In many implementations, such paths may be among the critical paths.
Sometimes, the need to get a trap indication from an ALU, FP ALU,
Load/store unit to the instruction fetch unit may create a long wire
that causes serious angst, or yelling in design meetings.

Integer Overflow is one of a), but it's simple enough not to be likely
to add gate delays. Nevertheless, it is something determinable only
late in the cycle, so many ISA designers have chose not to have it be
trappable. HP PA, MIPS, an Alpha designers all did choose the
minimalist approach, in which:
- There is no OVFL flag in the Condition Code ... because there is no
CC.
- There is a reasonably complete set of ADD / SUB operations, each with
2 flavors: arithmetic/signed and logical/unsigned. The former always
cause traps on overflow, the latter never do. Compilers generate the
latter for C unsigned, and for synthesis of complex addressing
arithmetic. This assumed that you wanted to make the normal case fast,
at the expense of needing multi-instruction sequences to get explicit
tests for overflow without trapping.

Anyway, this does show that it is possible, across a wide range of
designs, to detect integer Overflow in a timely fashion. Likewise,
it's even less time-constrained (as SPARC does) to trap on bit-tests in
the input values. Finally, a lot of floating-point trap tests can be
done on the input, or use the MIPS trick of examining the inputs and
stalling if it cannot be sure the operation will complete without trap
[discussed here earlier].

On the other hand, the kind of features that I described above in the
Kay are much tougher. To be really useful, you'd want to have
something like:
- a mask register that specified which bits of a value should be
checked
- a compare register
- a flag to say whether to trap if equal or not equal, i.e.,:
if (flag) then
{if ((value & mask) == (compare & mask)) then trap();}
else
{ if ((value & mask) != (compare & mask)) then trap();}

You could do this with (value XOR compare) & mask, but in any case, you
still need a comparator tree somewhere.

And then, you'd actually probably want several sets of
mask/compare/flag regs, and you'd need variants of operations that
would enable the checking. Note, of course, that

It *might* be plausible to use this for input operands, although no one
would appreciate the extra read ports/bus loads, but at least the
checks could go on in parallel with the ALU operation. Nobody would be
very happy about doing this on the output of the ALU or load
instructions.

This is a LOT of mechanism, and so needs serious justification ... and
as for doing counter comparisons, no way.

The closest real designs get to this sort of thing, or to
lower-overhead loop control are:

a) Special counter registers that help speed up branches, found in some
general ISAs.

b) Zero Overhead Loop (Buffers) found in some DSPs.

Anyway, it's pretty clear that relevant mechanisms were being discussed
~20 years ago, but nobody seems to have figured out features that
actually make implementation sense. I'd be delighted to see a
well-informed proposal that had sensible hardware/software
implementations and really helped LISP/Smalltalk/ADA and hopefully
other languages...

From: Jan Vorbrüggen on
> The best general idea we could come up with was:
> - A general low-overhead user-level trap mechanism (something I've
> wished for many times for other reasons).

Later, you mention the MIPS TRAP instruction(s)...is that along this line?
What about the VAX's CHMx with x=U (which passes an additional constant
parameter to the trap routine)? Couldn't one have a fast user-mode trap,
and then make sure that an address-alignment trap was handled that way?
(That would also go some way towards handling unaligned operands quicker.)

Jan
From: Seongbae Park on
John Mashey <old_systems_guy(a)yahoo.com> wrote:
> David Hopwood wrote:
>> andrewspencers(a)yahoo.com wrote:
>> > Terje Mathisen wrote:
>
>> A slightly different situation is where you have code that in practice
>> always handles integers that fit in a single word, but that can't be
>> statically guaranteed to do so, and the language specification says that
>> bignum arithmetic must be supported -- the obvious example being Smalltalk.
>> There were some attempts to support this in hardware (e.g. "Smalltalk on
>> a RISC"; also something on SPARC that I can't remember the details of),
>> but it turned out to be easier and faster for implementations of Smalltalk
>> and similar languages to use other tricks that don't require hardware support.
>
> Yes.
> 1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
> which used the high-30 bits as integers, and the low 2 bits as tags; if
> either low 2-bit field were non-zero, it trapped.

And taddcctv (Tagged-ADD-and-set-CC-with-Trap-on-oVerflow)
has been deprecated in SPARC v9,
meaning the opcode will continue to work as specified,
but no performance guarantee (i.e. it may be emulated entirely by software).
V9 specifically suggests replacing taddcctv
with taddcc (non-trap version, essentially just an addcc
with overflow bit from tag portion - hence it's not that expensive to implement)
followed by a branch-on-overflow-set (or by a trap-on-overflow-set).

Of course, having been declared "deprecated" in 1992,
this baggage still has to be carried forward -
it's not possible to reclaim the opcode space taken by these instructions yet.
I don't know if SPARC will ever be able to do so.
Probably not until there's a SPARC v10, if that ever happens.

....
> Anyway, it's pretty clear that relevant mechanisms were being discussed
> ~20 years ago, but nobody seems to have figured out features that
> actually make implementation sense.

Probably because there's no such feature,
beside a fast general purpose user-level trap mechanism you mentioned
in the snipped part of your post.

> I'd be delighted to see a
> well-informed proposal that had sensible hardware/software
> implementations and really helped LISP/Smalltalk/ADA and hopefully
> other languages...

I think LISP/Smalltalk/ADA market is just too small to justify
adding any significant change in the general purpose ISA,
unless this yet-to-be-invented mechanism is easy and cheap
to implement or it is for some other purpose which happened to
help them (like a fast user-level trap).

Having said that, it would be interesting to see any good proposal in this area.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
From: andrewspencers on
John Mashey wrote:
> The best general idea we could come up with was:
> - A general low-overhead user-level trap mechanism (something I've
> wished for many times for other reasons).
Isn't a user-level trap mechanism effectively available in the upcoming
Vanderpool/Pacifica processors due at the end of this year/beginning of
next year? (Though I don't know how low-overhead it'll be.)
Since the processors are (supposedly) fully virtualizable, user-level
and supervisor-level code no longer have different views of the
processor. I.e. in contemporary non-virtualizable systems, user-level
code knows that it's running at user-level, and is designed to use only
user-level processor features (and the kernel running at
supervisor-level kills it if it misbehaves), but in a fully
virtualizable system, user-level code can be designed to run at
supervisor-level and use all processor features (including traps)
because it doesn't know (or care) that it's actualy running at
user-level, and the kernel virtualizes the supervisor-level processor
features which the user-level code uses.
Although virtual machine monitors such as VMware do this, the critical
difference is that VMware running on standard x86 processors must
verify/interpret supervisor-level code running at user-level, so it's
only useful for coarse-grained virtualization (i.e. an entire virtual
machine running a standard OS), whereas a kernel running on the new
virtualizable x86 processors doesn't have to verify/interpret
supervisor-level code running at user-level since the processor traps
on all sensitive instructions, so it's useful for fine-grained
virtualization--i.e. regular programs, which run at user-level, can be
coded to run at supervisor level, and the virtualization has zero
performance impact except to the extent that those programs actually
use supervisor-level processor features.
VMware obviously is going to be written to take advantage of the new
processors, but my point here is that the host OS kernel could be
designed to take advantage of it too, to let even regular user-level
programs running on the host OS use supervisor-level features--e.g.
traps.