assembler speed... [ASM]

Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: x86 instruction set usage-difference between windows 95 and windows xp ?

From: Maxim S. Shatskih on 25 Mar 2010 15:52

> IL is loaded, run through codegen, and then:
> A. textual ASM is produced, and run through an assembler, and linked into
> the running image;
> B. machine code is produced in-place.

So, ASM is the intermediate form of IL->machine code translator.

And why this form? maybe there are other ways, more effective?

> A. has the potential of a much cleaner implementation,

Matter of taste.

>... but the cost that it is not as fast and, on average, there is more code (both in the codegen,
> + the code for the assembler);

Perf loss, code complexity loss - all for the matter of taste.

--
Maxim S. Shatskih
Windows DDK MVP
maxim(a)storagecraft.com
http://www.storagecraft.com

From: Rod Pemberton on 25 Mar 2010 18:21

"cr88192" <cr88192(a)hotmail.com> wrote in message
news:hofus4$a0r$1(a)news.albasani.net...
> basically, it is the question of whether or not a textual assembler is
fast
> enough for use in a JIT context (I believe it is, and that one can benefit
> notably from using textual ASM here).
>

Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...

We know that interpreters are a bit slower than compilers, and compilers do
take some time too. How fast is fast enough is very relative to 1)
generation of microprocessor, 2) size of files, 3) in-memory or on-disk, 4)
language complexity, etc.

> so, some tests:
> basically, I have tried assembling a chunk of text over and over again (in
a
> loop) and figuring out how quickly it was pushing through ASM.

You may just be testing the OS's buffering abilities here...

> initially, I found that my assembler was not performing terribly well, and
> the profiler showed that most of the time was going into zeroing memory. I
> fixed this, partly both by reducing the size of some buffers, and in a few
> cases disabling the 'memset' calls.

Instead of memset()-ing entire strings, you might try just setting the first
char to a nul character: str[0]='\0'; It's not as safe, but if your code
is without errors, it shouldn't be an issue.

Instead of strcmp(), you can try switches on single chars, while progressing
through the chars needed to obtain the required info. Sometimes this works
because you only need one or two characters out of much longer keywords to
distinguish it from other keywords.

Character directed parsing can speed things up too. Determining what the
syntax component is, say integer or keyword, takes time. But, if you put a
character infront that indicates what follows, you don't have
to do that processing to determine if it's an integer or keyword. E.g., an
example from an assembler of mine:

..eax _out $255

"dot" indicates a register follows. "underscore" indicates instruction
follows. "dollar-sign" indicates a decimal integer follows. Each directive
character is passed to switch() which selects the appropriate parsing
operation. The parser doesn't have to determine _what_ "eax" or "out" or
"255" is. It "knows" from the syntax. That's a large part of parsing logic
eliminated. When you program, you know what the directive character is and
can easily insert the correct character. Code generators also "know" too -
since you coded it... It's just an inconvenience to type the extra
characters, if you're doing alot of assembly.

If you use memory instead of file I/O, processing will be faster. Linked
lists, esp. doubly linked, can also speed up in-memory processing.
Allocation of memory in a single large block, instead of calling malloc()
repeatedly "as you go" or as needed, can simplify the arrangement of objects
in the allocated memory. It can eliminate pointers. Reduce the object
size. etc.

> I then noted that most of the time was going into my case-insensitive
> compare function, which is a bit slower than the case-sensitive compare
> function (strcmp).

Decide on one case, such as lowercase. That cuts your processing in half.
Use hash functions. They can eliminate multiple strcmp()'s. Try to only
strcmp() once, to eliminate possible collisions.

> and, as well, I guess the volumes of ASM I assemble are low enough that it
> has not been much of an issue thus far (I tend not to endlessly
re-assemble
> all of my libraries, as most loadable modules are in HLL's, and binary
> object-caching tends to be used instead of endless recompilation...).

If it's low use, you can eliminate much code by removing checks. I.e., if
you know your compiler correctly emits registers "eax", "ebx", etc., don't
implement a check for invalid registers. Some people would call such
techniques "unsafe" programming - which is true. But, since the code is
used in a controlled environment and without any "garbage" input, it'll
speed things up if the code does less work such as safety checks.

Rod Pemberton

From: BGB / cr88192 on 25 Mar 2010 23:04

[was responding to this earlier, but Windows blue-screened...].

"Rod Pemberton" <do_not_have(a)havenone.cmm> wrote in message
news:hognj6$orq$1(a)speranza.aioe.org...
> "cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hofus4$a0r$1(a)news.albasani.net...
>> basically, it is the question of whether or not a textual assembler is
> fast
>> enough for use in a JIT context (I believe it is, and that one can
>> benefit
>> notably from using textual ASM here).
>>
>
> Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...

can't say, I have not used tcc.
I hear it compiles fairly fast though.

> We know that interpreters are a bit slower than compilers, and compilers
> do
> take some time too. How fast is fast enough is very relative to 1)
> generation of microprocessor, 2) size of files, 3) in-memory or on-disk,
> 4)
> language complexity, etc.

there is no disk IO here...

>> so, some tests:
>> basically, I have tried assembling a chunk of text over and over again
>> (in
> a
>> loop) and figuring out how quickly it was pushing through ASM.
>
> You may just be testing the OS's buffering abilities here...
>

no OS involvement, only memory buffers...

>> initially, I found that my assembler was not performing terribly well,
>> and
>> the profiler showed that most of the time was going into zeroing memory.
>> I
>> fixed this, partly both by reducing the size of some buffers, and in a
>> few
>> cases disabling the 'memset' calls.
>
> Instead of memset()-ing entire strings, you might try just setting the
> first
> char to a nul character: str[0]='\0'; It's not as safe, but if your code
> is without errors, it shouldn't be an issue.
>

the memory zeroing was mostly in my COFF writer, which initially used, and
zeroed, a fairly large temporary buffer.

I since both made the temp buffer smaller and disabled the memset, so this
is no longer an issue.

> Instead of strcmp(), you can try switches on single chars, while
> progressing
> through the chars needed to obtain the required info. Sometimes this
> works
> because you only need one or two characters out of much longer keywords to
> distinguish it from other keywords.
>

possible, however switches are in general a fairly awkward way to select
between tokens.

> Character directed parsing can speed things up too. Determining what the
> syntax component is, say integer or keyword, takes time. But, if you put
> a
> character infront that indicates what follows, you don't have
> to do that processing to determine if it's an integer or keyword. E.g.,
> an
> example from an assembler of mine:
>
> .eax _out $255
>
> "dot" indicates a register follows. "underscore" indicates instruction
> follows. "dollar-sign" indicates a decimal integer follows. Each
> directive
> character is passed to switch() which selects the appropriate parsing
> operation. The parser doesn't have to determine _what_ "eax" or "out" or
> "255" is. It "knows" from the syntax. That's a large part of parsing
> logic
> eliminated. When you program, you know what the directive character is
> and
> can easily insert the correct character. Code generators also "know"
> too -
> since you coded it... It's just an inconvenience to type the extra
> characters, if you're doing alot of assembly.

this, of course, would break NASM syntax compatibility (as well as break a
lot of the code already existing within my codebase).

> If you use memory instead of file I/O, processing will be faster. Linked
> lists, esp. doubly linked, can also speed up in-memory processing.
> Allocation of memory in a single large block, instead of calling malloc()
> repeatedly "as you go" or as needed, can simplify the arrangement of
> objects
> in the allocated memory. It can eliminate pointers. Reduce the object
> size. etc.

my assembler uses relatively few in-memory objects.
mostly, it is buffer operations...

no file IO is used here, as file-IO is teh-slow, and also makes little sense
for moving data from place-to-place within an app...

>> I then noted that most of the time was going into my case-insensitive
>> compare function, which is a bit slower than the case-sensitive compare
>> function (strcmp).
>
> Decide on one case, such as lowercase. That cuts your processing in half.
> Use hash functions. They can eliminate multiple strcmp()'s. Try to only
> strcmp() once, to eliminate possible collisions.

hashes are already used.
case-insensitive handling is used as my assembler was based some off of
NASM's syntax, and NASM is case-insensitive.

admitted, there are some differences between them, and adding case
sensitivity would be just another item to the list...

>> and, as well, I guess the volumes of ASM I assemble are low enough that
>> it
>> has not been much of an issue thus far (I tend not to endlessly
> re-assemble
>> all of my libraries, as most loadable modules are in HLL's, and binary
>> object-caching tends to be used instead of endless recompilation...).
>
> If it's low use, you can eliminate much code by removing checks. I.e., if
> you know your compiler correctly emits registers "eax", "ebx", etc., don't
> implement a check for invalid registers. Some people would call such
> techniques "unsafe" programming - which is true. But, since the code is
> used in a controlled environment and without any "garbage" input, it'll
> speed things up if the code does less work such as safety checks.

my main codegen is only one place which emits ASM.

there are many other things which emit ASM, as it is currently the main
language in use for dynamically-generated code fragments.

no error checking code shows significantly on the profiler though, and most
of my optimization effort was profiler driven...

From: BGB / cr88192 on 25 Mar 2010 23:23

"Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
news:20100325201911.GB3453(a)yoda.inglorion.net...
> On Thu, Mar 25, 2010 at 12:22:02PM -0700, cr88192 wrote:

<snip>

>>
>> > To answer all the questions here, it would probably be a good idea to
>> > first come up with a definition of "fast enough", and then, if you find
>> > your program isn't fast enough by this definition, to profile it to
>> > figure
>> > out where it is spending most of its time.
>>
>> well, I know about my programs.
>> the question is, what about everyone else?...
>
> I wouldn't worry about that too much, unless your code generator is the
> most interesting part of what you are making. First, make it work. Then
> you can think about making it better - assuming you don't have more
> interesting things to tackle.
>

my code generator has been working for 3 years now...

the whole point of all of this would be if other people can/should use
textual assemblers rather than raw machine code.

I guess maybe the issue is some about performance, and maybe 20 or 50MB/s
would be needed for it to be "fast enough"...

>> > Another question is why you would be going through assembly code at
>> > all.
>> > What benefit does it provide, compared to, for example, generating
>> > machine
>> > code directly? Surely, if speed is a concern, you could benefit from
>> > cutting out the assembler altogether.
>>
>> producing it directly is IMO a much less nice option.
>> it is barely even a workable strategy with typical bytecode formats, and
>> with x86 machine code would probably suck...
>
> I don't really see that. The way I see it, most of the work is in getting
> from what you have (presumably some instruction-set-independent source
> code
> or intermediate representation) to the instructions of your target
> platform.
> Once you are there, I think emitting these instructions as binary or as
> text doesn't make too much of a difference.

well, one has different code to emit either one.

for textual ASM, it is a huge mass of "print" statements.
for raw machine code, likely it would be a mass of "*ct++=0xB8;" or similar.

API-driven assemblers are sort of middle-ground.

fooasm_mov_regreg(fooasm_eax, fooasm_ecx);
....

> I've written code to emit binary instructions for various targets, and,
> in my experience, it's not very hard. Sure, x86's ModRM is a bit tricky,
> but you write that once and then it will just sit there, doing its job.
> In the grand scheme of writing a compiler, this isn't a big deal.

it is not "tricky", it is tedious and it is nasty...

by the time one writes a function to handle ModRM for them, they will be
tempted to write a function for REX, and maybe for the opcodes, and soon
enough they are on their way to having an assembler...

> Generating the instructions in binary form right away also makes it very
> easy to know exactly where your code ends up and what its size is, which
> may actually make it _easier_ to patch addresses into your code and
> make decisions about short vs. long jumps.

well, a very simple strategy works well enough: "jmp foo".

and the assembler figures out whether a long or short jump is needed...

>> admittedly, if really needed I could add a binary-ASM API to my assembler
>> (would allow using function calls to generate ASM), but this is likely to
>> be
>> much less nice than using a textual interface, and could not likely
>> optimize
>> jumps (likely short jumps would need to be explicit).
>
> My experience is that how nice the API is depends very much on the
> language you express it in. For example, I've tried to come up with a nice
> API for instruction generation in C, but never got it to the point where
> I was really happy with it. In a language which lets you write out a
> data structure in-line, preferably with automatic memory management and
> namespaces, this is much easier.

C is assumed here...

> It's the difference between, for example:
>
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> and
>
> (emit code '(or (reg ebx) (imm 42)))
>

yep.

my assemblers' original binary API wasn't too much different than the
above...

it was so horrible that originally I wrote the parser mostly to wrap these
horrid-looking API calls...

digging around, I eventually found some old code of mine (from jan 2007)
targetting this original API:

ASM_EmitLabel(ctx, "$incref");
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_IncRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

ASM_EmitLabel(ctx, "$decref");
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_DecRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

ASM_EmitLabel(ctx, "$incref_eax");
ASM_OutOpReg(ctx, ASM_OP_PUSH, ASM_EAX);
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_IncRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpReg(ctx, ASM_OP_POP, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

ASM_EmitLabel(ctx, "$decref_eax");
ASM_OutOpReg(ctx, ASM_OP_PUSH, ASM_EAX);
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_DecRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpReg(ctx, ASM_OP_POP, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

From: Alexei A. Frounze on 26 Mar 2010 01:40

On Mar 25, 1:19 pm, Robbert Haarman <comp.lang.m...(a)inglorion.net>
wrote:
....
> It's the difference between, for example:
>
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> and
>
> (emit code '(or (reg ebx) (imm 42)))

Umm... Looks Lispy! :)

For fun I've once implemented an x86 assembler (NASMish, but with much
less functionality) in Perl. It was pretty compact (~50KB of source
code). A C solution would've been much bigger. The perf relationship
would've been the opposite. Which is, nonetheless, to say, domain
specific or task oriented languages are a good thing.

Alex

First | Prev | Next | Last
Pages: 1 2 3
Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: x86 instruction set usage-difference between windows 95 and windows xp ?