a bytecode idea spec... [General Programming]

Prev: need ur help for my Masters project(TASM project)
Next: ANN: Seed7 Release 2010-07-04

From: BGB / cr88192 on 5 Jul 2010 03:02

"Mike Austin" <mike(a)mike-nospam-austin.com> wrote in message
news:GKmdnb3R96WKlqzRnZ2dnUVZ_s-dnZ2d(a)giganews.com...
> BGB / cr88192 wrote:
>> "Jacko" <jackokring(a)gmail.com> wrote in message
>> news:974bdba6-f158-4ab5-85eb-72ef53e745c4(a)k39g2000yqd.googlegroups.com...
>>> Just looked.
>>>
>>> My VM spec's at http://acebforth.googlecode.com
>>>
>>> Just 23 opcodes.
>>>
>
> A Tree-Based Alternative to Java Byte-Codes
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.2124&rep=rep1&type=pdf
>
> 0 opcodes :)
>
> But on a more serious note, a tree can be plenty fast when implemented
> efficiently. It doesn't physically have to be a tree.
>

yep, this is a reasonable option...

actually, my existing RPNIL is relatively high-level (a bit higher level
than JBC, and on similar grounds to MSIL). however, typically, it is still
stack-based. ordering issues have also been a problem in the past...

ASTs could work...

actually, this reminds me of in the past having designed binary XML
variants, most offering notably better density than WBXML (typically by
combining the XML serialization with an LZP / LZ + Markov Predictor
strategy).

little stops similar from being used, except that I hadn't really considered
transporting code in the form of ASTs, since this still leaves a lot of the
compiler work on the client, and as well is the matter of abstracting the
AST from the source language (an AST is likely to be largely specialized to
a small number of fairly similar languages).

at a lower level, a little more language neutrality can be gained, but
granted there are other costs.

a stack based IL is at least fairly fast and easy to interpret (AST
interpreters are usually slower), and the potential ordering issues are
annoying but not show-stoppers (in a compiler, the IL stack need not have
much in common with the real stack, as the IL stack may in effect only deal
with a virtual conception of a stack existing within the compiler, rather
than the actual underlying CPU stack...). for example, 'dup' and 'exch' are
cheap, typically because they don't actually "do" anything.

this is partly how my existing RPNIL compiler works...

however, the piece of code I am imagining is slightly atypical, as the use
of a table-based opcode system may hint at: an interpreter for this is not
likely to use the big-switch strategy so often used with interpreters...

or such...

From: Robbert Haarman on 5 Jul 2010 03:10

On Sun, Jul 04, 2010 at 09:16:04AM -0700, BGB / cr88192 wrote:
>
> "Jacko" <jackokring(a)gmail.com> wrote in message
> news:974bdba6-f158-4ab5-85eb-72ef53e745c4(a)k39g2000yqd.googlegroups.com...
> > Just looked.
> >
> > My VM spec's at http://acebforth.googlecode.com
> >
> > Just 23 opcodes.
> >
>
> some of my main bytecodes tend to include a bit more...
>
> but, OTOH, handling statically typed C-family languages tends to result in a
> lot of opcodes.

Why is that? I have always designed my bytecodes to be either some tree form,
or a RISC instructions set. In both cases, they have operated on machine
words, and the operations have been things like ALU operations, branching,
call, return, and goto.

Since these byte codes offer essentially the same operations that instruction
sets for actual CPUs offer, they can be used as compilation targets much
the same as actual, physical machine architectures, and the same techniques
for instruction selection, optimization, etc. apply. As with RISC instruction
sets, there is no notion of types, and languages can be compiled to these
bytecodes regardless of the source language's type system.

Apparently, your bytecode designs differ from mine so that the type system
has an impact on the number of operations offered by the bytecode. Could you
explain how that works, and what advantage it offers over an "everything
is a machine word" bytecode?

> but, yeah, the nifty point would be having a bytecode with a non-fixed
> opcode assignment, mostly so that it can be used with different interpreters
> or JIT machinery, without me having to force them all into using exactly the
> same numbering, ...

What would be the advantage of that? If two interpreters offer the same
operation, why not assign it the same code? The advantage of doing so would
be compatibility of bytecode across interpreters, as long as the interpreter
running the program supports all the used operations. Changing the numbers,
it seems to me, ensures that even the chance of compatibility is thrown out
of the window. I see that as a disadvantage. I fail to see what you gain
by it.

Regards,

Bob

From: Jacko on 5 Jul 2010 07:49

Added 2 extra opcode and altered one slightly. As the opcode set is
possibly going to be 'hardwired' at some stage, it makes sense to set
limits on what a opcode can do.

* Only one stack item con be created or destroyed.
* Only one return stack item can be created or destroyed.
* Only the top two stack items maybe modified.
* Only the top return stack item may be modified.
* An implicit subroutine return stack prefetch is always performed,
with a return to threading.
* Othe data or return stack items may be fetched.
* The zero valued machine word prefix on all opcodes exits from
threading and runs one opcode.

This keeps the virtual machine complexity down.

Packed binary format is UTF-16 -> UTF-8 (option) -> BWTS -> LZ77/
DEFLATE with start vector 0.

Wider word sizes are by UTF-32 -> UTF-8 mapping higher codes, and
folding the last bit. 111111xx (10xxxxxx) ^ 5 ...

There are seven reserved opcodes 25 to 31 for alternate language
support. All opcodes above 31 are undefined.

Cheers Jacko

From: BGB / cr88192 on 5 Jul 2010 13:01

"Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
news:20100705071043.GA2635(a)yoda.inglorion.net...
> On Sun, Jul 04, 2010 at 09:16:04AM -0700, BGB / cr88192 wrote:
>>
>> "Jacko" <jackokring(a)gmail.com> wrote in message
>> news:974bdba6-f158-4ab5-85eb-72ef53e745c4(a)k39g2000yqd.googlegroups.com...
>> > Just looked.
>> >
>> > My VM spec's at http://acebforth.googlecode.com
>> >
>> > Just 23 opcodes.
>> >
>>
>> some of my main bytecodes tend to include a bit more...
>>
>> but, OTOH, handling statically typed C-family languages tends to result
>> in a
>> lot of opcodes.
>
> Why is that? I have always designed my bytecodes to be either some tree
> form,
> or a RISC instructions set. In both cases, they have operated on machine
> words, and the operations have been things like ALU operations, branching,
> call, return, and goto.
>
> Since these byte codes offer essentially the same operations that
> instruction
> sets for actual CPUs offer, they can be used as compilation targets much
> the same as actual, physical machine architectures, and the same
> techniques
> for instruction selection, optimization, etc. apply. As with RISC
> instruction
> sets, there is no notion of types, and languages can be compiled to these
> bytecodes regardless of the source language's type system.
>
> Apparently, your bytecode designs differ from mine so that the type system
> has an impact on the number of operations offered by the bytecode. Could
> you
> explain how that works, and what advantage it offers over an "everything
> is a machine word" bytecode?
>

mine is a bit higher level, and so what the bytecode sees is not exactly the
same as the world the machine sees.

so, one doesn't have highly general purpose instructions like mov or lea,
rather a fair number of specialized instructions.

similarly, the ISA doesn't operare on "machine words", rather, it operates
on opaque typed references or values, each of which has an undefined size at
the IL level (although there is a 'sizeof' opcode, ...).

in the native ISA, the actual values are managed by the lower-end compiler,
which figures out their size and representation. in an interpreter, it would
probably use references instead (I have yet to entirely decide between using
dynamically-typed references or fat-references, each has their tradeoffs).

examples:
loadindex/setindex: load or set an array index (accepts array, pointer, ...)
loadslot/setslot: load or set a struct/object slot (accepts struct, struct
pointer, object reference, ...)
call/call_s/callmethod/callmethod_s: perform calls to functions or method,
where the generic form takes the reference on the stack, and the '_s' form
has a hardcoded name;
as well, load/store/loada/... for loading/storing variables to/from the
stack.

then there are opcodes like: inc2_s/dec2_s: add or subtract 2 from a named
variable.
then there are opcodes for various manner of conditional jumps (a generic
"jump if true/false" followed by a number which perform a compare and a
jump).

as well as the arithmetic opcodes, ...

these all lead to a fair number of opcodes in total...

>> but, yeah, the nifty point would be having a bytecode with a non-fixed
>> opcode assignment, mostly so that it can be used with different
>> interpreters
>> or JIT machinery, without me having to force them all into using exactly
>> the
>> same numbering, ...
>
> What would be the advantage of that? If two interpreters offer the same
> operation, why not assign it the same code? The advantage of doing so
> would
> be compatibility of bytecode across interpreters, as long as the
> interpreter
> running the program supports all the used operations. Changing the
> numbers,
> it seems to me, ensures that even the chance of compatibility is thrown
> out
> of the window. I see that as a disadvantage. I fail to see what you gain
> by it.

not really:
different interpreters can retain compatibility by instead using symbolic
binding, rather than relying on specific opcode numbers (really, it is not
too much different than doing DLL linkage).

in the naive case, this would likely mean either a renumbering or
function-pointer table to be used in the interpreter, where each number is
looked up and assigned a spot in the re-index table (in this strategy,
likely such a table would be needed for every module or code-block, which
could add a little cost).

a slightly less naive case would be to remap it to an internal form (an
interpreter-specific form, likely more optimized for the specific
interpreter machinery, or JIT'ing the code to native machine code), or, what
I had considered, pre-decode the opcodes into a linked-list stucture with
each "opcode" holding a function pointer to the code to execute it.

now, the main problem with fixed opcode numbers is that they lead to
much-increased "managerial" effort, where if one adds an opcode one place,
they also have to add it everywhere else (which can be a problem if ones'
compiler machinery is getting bulky).

one can share tables between different components (say, the compiler and
interpreter share the same internal "opcodes" table), but this is uncertain
WRT modularity. (note that "tables" are fairly central in a lot of my stuff,
as well, a lot of it tends to be partly machine-written as well...).

if a 3rd party were to also plug into this VM, they would either have to
reuse this table, or do their own clone table, meaning that fixed-number
opcodes either:
lead to increased dependencies between code, and hinder the creation of 3rd
party implementations;
code has to stick rigidly to some centrally-assigned list of opcodes,
limiting the possibility flexibility and creativity in 3rd parties (they
can't as easily add new features, or adapt the ISA to their own uses, since
this would risk clashing with the original ISA).

for example, one can use something like MSIL or JBC the same way MS or Sun
does, but given they make relatively few provisions for "generally"
extending the formats, there is no real way to flexibly add features without
breaking compatibility for any modules using these features, and there is no
real good way for multiple implementations to "share" features or ideas
(since, for what limited extensibility is offered, there is no real
standardized way to identify the type/nature of these extensions, nor for
one parties' extensions to avoid clashing with others' extensions...).

for example, for JBC this would likely mean having some defined coding and
authority for FE and FF extension opcodes, which would bog everything down,
or building a new mechanism and trying to gain support.

even in newer Sun stuff, they themselves have ended up resorting to piles of
hackery WRT extending JVM's core facilities, since a simple ill-concieved
extension basically means their tower can collapse (like in Jenga or
something), which is sub-optimal (not even x86 machine code has it this bad,
one can hack new opcodes into this all they want so long as they pay some
caution to what Intel or Sun is doing...).

so, the idea is that if most things are defined within the IL (such as the
number and layout of tables, the names and encodings of opcodes, ...) then
little stops a 3rd party from adding or customizing features in a way which
doesn't break existing implementations (which can ignore the new features,
or throw an exception or similar if control reaches an unhandled opcode,
while still being able to otherwise work with the opcodes).

also, it would allow me, in my case, to use the same IL for somewhat
divergent VM parts (some stack-based, some potentially using other forms,
such as TAC/SSA or similar, or potentially even for things like "decompiled"
x86 machine code, or for transcribing JBC or MSIL, allowing additional
shared machinery to be used rather than having to maintain as much machinery
for these various edge cases...).

it also reduces the need for a lot of tools to have "complete" knowledge of
the format (for example, one can disassemble the bytecode while not having
some central opcode table, since the data needed to disassemble is already
in the IL).

or such...

From: Jacko on 5 Jul 2010 13:27

> then there are opcodes like: inc2_s/dec2_s: add or subtract 2 from a named
> variable.
> then there are opcodes for various manner of conditional jumps (a generic
> "jump if true/false" followed by a number which perform a compare and a
> jump).

Things like 1+ 2* etc, I have avoided making opcodes for. The codes
above 31 are unused/undefined/always free for system defined
optimization. In that sense they are secondary optimization opcodes.
Only primary computation essentials and structural foundational
opcodes are to be considered for hard assignment. There are only 7
more slots for this purpose.

> >> but, yeah, the nifty point would be having a bytecode with a non-fixed
> >> opcode assignment, mostly so that it can be used with different
> >> interpreters
> >> or JIT machinery, without me having to force them all into using exactly
> >> the
> >> same numbering, ...
>
> > What would be the advantage of that? If two interpreters offer the same
> > operation, why not assign it the same code? The advantage of doing so
> > would
> > be compatibility of bytecode across interpreters, as long as the
> > interpreter
> > running the program supports all the used operations. Changing the
> > numbers,
> > it seems to me, ensures that even the chance of compatibility is thrown
> > out
> > of the window. I see that as a disadvantage. I fail to see what you gain
> > by it.
>
> not really:
> different interpreters can retain compatibility by instead using symbolic
> binding, rather than relying on specific opcode numbers (really, it is not
> too much different than doing DLL linkage).

In some senses an number could be a variable name.

> for example, one can use something like MSIL or JBC the same way MS or Sun
> does, but given they make relatively few provisions for "generally"
> extending the formats, there is no real way to flexibly add features without
> breaking compatibility for any modules using these features, and there is no
> real good way for multiple implementations to "share" features or ideas
> (since, for what limited extensibility is offered, there is no real
> standardized way to identify the type/nature of these extensions, nor for
> one parties' extensions to avoid clashing with others' extensions...).

Yes, the name space clashes when the names are numeric is somewhat
limited.

> for example, for JBC this would likely mean having some defined coding and
> authority for FE and FF extension opcodes, which would bog everything down,
> or building a new mechanism and trying to gain support.
>
> even in newer Sun stuff, they themselves have ended up resorting to piles of
> hackery WRT extending JVM's core facilities, since a simple ill-concieved
> extension basically means their tower can collapse (like in Jenga or
> something), which is sub-optimal (not even x86 machine code has it this bad,
> one can hack new opcodes into this all they want so long as they pay some
> caution to what Intel or Sun is doing...).

> it also reduces the need for a lot of tools to have "complete" knowledge of
> the format (for example, one can disassemble the bytecode while not having
> some central opcode table, since the data needed to disassemble is already
> in the IL).

Yes.

Cheers Jacko

First | Prev | Next | Last
Pages: 1 2 3
Prev: need ur help for my Masters project(TASM project)
Next: ANN: Seed7 Release 2010-07-04