status update 1 (Re: assembler speed...) [ASM]

Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs

From: Robbert Haarman on 28 Mar 2010 05:49

Hi Rod,

On Sun, Mar 28, 2010 at 04:22:48AM -0400, Rod Pemberton wrote:
> "Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
> news:20100328074138.GA3467(a)yoda.inglorion.net...
> >
> > First, some data from /proc/cpuinfo:
> >
> > model name : AMD Athlon(tm) Dual Core Processor 5050e
> > cpu MHz : 2600.000
> > cache size : 512 KB
> > bogomips : 5210.11
> >
>
> Unrelated FYI, your BogoMips should be twice that for that cpu. I suspect
> you listed it for _one_ core, as /proc/cpuinfo does.

Yes. These lines are taken from /proc/cpuinfo, and are for one of the two
cores. The BogoMIPS rating for both cores taken together is indeed twice
that.

Note that the benchmark I ran only uses a single core. I also performed
my calculations as if there was only a single core. That is, the 24 cycles
per generated instruction are those of the core generating the code;
cycles of the core that is sitting idle are not taken into account.

Alchemist is not currently thread-safe, because of two pieces of global
state: a mode which can be set to 32-bit or 16-bit, and an error variable.
It would not be hard to make this state thread-local, or indeed to change the
interface so that global state is eliminated entirely, but I am not currently
working on Alchemist anymore, and even if I were, this change wouldn't be
very high on my priority list.

Regards,

Bob

From: BGB / cr88192 on 28 Mar 2010 11:25

"Rod Pemberton" <do_not_have(a)havenone.cmm> wrote in message
news:homvnn$f7j$1(a)speranza.aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:homqsi$s25$1(a)news.albasani.net...
>> [...]
>> 10MB/s (analogue) can be gained by using a direct binary interface (newly
>> added).
>> in the case of this mode, most of the profile time goes into a few
> predicate
>> functions, and also the function for emitting opcode bytes. somehow, I
> don't
>> think it is likely to be getting that much faster.
>>
>
> A few years ago, I posted the link below for large single file programs
> (talking to you...). I'm not sure if you ever looked their file sizes,
> but
> the largest two were gcc as a single file and an ogg encoder as a single
> file, at 3.2MB and 1.7MB respectively. Those are probably the largest
> single file C programs you'll see. It's possible, even likely, some
> multi-file project, say the Linux kernel etc., is larger. But, 10MB/s
> should still be very good for most uses. But, there's no reason to stop
> there, if you've got the time!
>
> http://people.csail.mit.edu/smcc/projects/single-file-programs/
>

now that I am reminded, I remember them some, but not much...

>> stated another way: 643073 opcodes/second, or about 1.56us/op.
>> calculating from CPU speed, this is around 3604 clock cycles / opcode
>> (CPU
> =
>> 2.31 GHz).
>
> BTW, what brand of cpu, and what number of cores are being used?
>

AMD Athlon 64 X2 4400.
however, all this runs in a single thread, so the number of cores doesn't
effect much.

internally, it runs at 2.31 GHz I think, and this becomes more notable when
doing some types of benchmarks.

my newer laptop has an Pentium 4M or similar, and outperforms my main
computer for raw computational tasks, but comes with rather lame video HW
(and so still can't really play any games much newer than HL2, which runs
similarly well on my old laptop despite my old laptop being much slower in
general...).

>> to get any faster would likely involve sidestepping the assembler as well
>> (such as using a big switch and emitting bytes), but this is not
>> something
> I
>> am going to test (would make about as much sense as benchmarking it
> against
>> memcpy or similar, since yes, memcpy is faster, but no, it is not an
>> assembler...).
>
> OpenWatcom is (or was) one of the fastest C compilers I've used. It
> skipped
> emitting assembly. Given the speed, I'm sure they did much more than
> that... It might provide a reference point for a speed comparison. I
> haven't used more recent versions (I'm using v1.3). So, I'm assuming the
> speed is still there.
>

well, all this is for my assembler (written in C), but it assembles ASM
code.

note that my struct-array interface doesn't currently implement all the
features of the assembler.

From: BGB / cr88192 on 28 Mar 2010 12:07

"Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
news:20100328074138.GA3467(a)yoda.inglorion.net...
> On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote:
>>
>> "cr88192" <cr88192(a)hotmail.com> wrote in message
>> news:hofus4$a0r$1(a)news.albasani.net...
>>
>> well, a status update:
>> 1.94 MB/s is the speed which can be gained with "normal" operation
>> (textual
>> interface, preprocessor, jump optimization, ...);
>> 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor
>> and
>> forces single-pass assembly.
>>
>>
>> 10MB/s (analogue) can be gained by using a direct binary interface (newly
>> added).
>> in the case of this mode, most of the profile time goes into a few
>> predicate
>> functions, and also the function for emitting opcode bytes. somehow, I
>> don't
>> think it is likely to be getting that much faster.
>>
>> stated another way: 643073 opcodes/second, or about 1.56us/op.
>> calculating from CPU speed, this is around 3604 clock cycles / opcode
>> (CPU =
>> 2.31 GHz).
>
> To provide another data point:
>
> First, some data from /proc/cpuinfo:
>
> model name : AMD Athlon(tm) Dual Core Processor 5050e
> cpu MHz : 2600.000
> cache size : 512 KB
> bogomips : 5210.11
>

well, that is actually a faster processor than I am using...

> I did a quick test using the Alchemist code generation library. The
> instruction sequence I generated is:
>
> 00000000 33C0 xor eax,eax
> 00000002 40 inc eax
> 00000003 33DB xor ebx,ebx
> 00000005 83CB2A or ebx,byte +0x2a
> 00000008 CD80 int 0x80
>
> for a total of 10 bytes. Doing this 100000000 (a hundred million) times
> takes about 4.7 seconds.
>

I don't know the bytes output, I was measuring bytes of textual-ASM input:
"num_loops * strlen(input);" essentially.

in the structs-array case, I pre-parsed the example, but continued to
measure against this sample (as-if it were still being assembled each time).

> Using the same metrics that you provided, that is:
>
> About 200 MB/s
> About 100 million opcodes generated per second
> About 24 CPU clock cycles per opcode generated
>

yeah, but they are probably doing something differently.

I found an "alchemist code generator", but it is a commercial app which
processes XML and uses an IDE, so maybe not the one you are referencing
(seems unlikely).

my lib is written in C, and as a general rule has not been "micro-turned for
max performance" or anything like this (and also is built with MSVC, with
debug settings).

I have been generally performance-tuning a lot of the logic, but not
actually changing much of its overall workings (since notable structural
changes would risk breaking the thing).

mine also still goes through most of the internal logic of the assembler,
mostly bypassing the front-end parser and using pre-resolved opcode numbers
and similar.

emitting each byte is still a function call, and may check for things like
the need to expand the buffer, ...
the output is still packaged into COFF objects (though little related to
COFF is all that notable on the profiler).

similarly, the logic for encoding the actual instructions is still
ASCII-character-driven-logic (it loops over a string, using characters to
give commands such as where the various prefixes would go, where REX goes,
when to place the ModRM bytes, ...). actually, the logic is driven by an
expanded form of the notation from the Intel docs...

there is very little per-instruction logic (such as instruction-specific
emitters), since this is ugly and would have made the thing larger and more
complicated (but, granted, it would have been technically faster).

hence, why I say this is a case of the "switch limit", which often causes a
problem for interpreters:
most of the top places currently in the profiler are switch statements...

this ASCII-driven-logic is actually the core structure of the assembler, and
so is not really removable. otherwise my tool for writing parts of my
assembler for me would have to be much more complicated (stuff is generated
from the listings, which tell about things like how the instructions are
structured, what registers exist, ...).

actually, a lot of places in my framework are based around ASCII-driven
logic (strings are used, with characters used to drive particular actions in
particular pieces of code, typically via switch statements).

this would include my x86 interpreter, which reached about 1/70th native
speed.

but, hell, people would probably really like my C compiler upper-end, as
this is essentially a huge mass of XML-processing code... (although no XSLT,
instead mostly masses of C code which recognizes specific forms and work
with them...).

From: Branimir Maksimovic on 28 Mar 2010 13:14

On Sun, 28 Mar 2010 08:25:45 -0700
"BGB / cr88192" <cr88192(a)hotmail.com> wrote:

>
> "Rod Pemberton" <do_not_have(a)havenone.cmm> wrote in message
> news:homvnn$f7j$1(a)speranza.aioe.org...
> > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> > news:homqsi$s25$1(a)news.albasani.net...
> >> [...]
> >> 10MB/s (analogue) can be gained by using a direct binary interface
> >> (newly added).
> >> in the case of this mode, most of the profile time goes into a few
> > predicate
> >> functions, and also the function for emitting opcode bytes.
> >> somehow, I
> > don't
> >> think it is likely to be getting that much faster.
> >>
> >
> > A few years ago, I posted the link below for large single file
> > programs (talking to you...). I'm not sure if you ever looked
> > their file sizes, but
> > the largest two were gcc as a single file and an ogg encoder as a
> > single file, at 3.2MB and 1.7MB respectively. Those are probably
> > the largest single file C programs you'll see. It's possible, even
> > likely, some multi-file project, say the Linux kernel etc., is
> > larger. But, 10MB/s should still be very good for most uses. But,
> > there's no reason to stop there, if you've got the time!
> >
> > http://people.csail.mit.edu/smcc/projects/single-file-programs/
> >
>
> now that I am reminded, I remember them some, but not much...
>
>
> >> stated another way: 643073 opcodes/second, or about 1.56us/op.
> >> calculating from CPU speed, this is around 3604 clock cycles /
> >> opcode (CPU
> > =
> >> 2.31 GHz).
> >
> > BTW, what brand of cpu, and what number of cores are being used?
> >
>
> AMD Athlon 64 X2 4400.
> however, all this runs in a single thread, so the number of cores
> doesn't effect much.
>
>
> internally, it runs at 2.31 GHz I think, and this becomes more
> notable when doing some types of benchmarks.
>
> my newer laptop has an Pentium 4M or similar, and outperforms my main
> computer for raw computational tasks, but comes with rather lame
> video HW (and so still can't really play any games much newer than
> HL2, which runs similarly well on my old laptop despite my old laptop
> being much slower in general...).

Well, measured some quad xeon against dual athlon slower
than your in initializing 256mb of ram 4 threads xeon, 2 threads
athlon, same speed.
Point is that same speed was with 3.2 ghz strongest dual athlon as well.
Intel external memory controller models are slower with memory
than athlons. You need to overclock to at least 400mhz FSB to compete
with athlons.

Greets

--
http://maxa.homedns.org/

Sometimes online sometimes not

From: Robbert Haarman on 28 Mar 2010 13:37

Hi cr,

On Sun, Mar 28, 2010 at 09:07:05AM -0700, BGB / cr88192 wrote:
>
> "Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
> news:20100328074138.GA3467(a)yoda.inglorion.net...
> > On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote:
> >>
> >> "cr88192" <cr88192(a)hotmail.com> wrote in message
> >> news:hofus4$a0r$1(a)news.albasani.net...
> >>
> >> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> >> added).
> >> in the case of this mode, most of the profile time goes into a few
> >> predicate
> >> functions, and also the function for emitting opcode bytes. somehow, I
> >> don't
> >> think it is likely to be getting that much faster.
> >>
> >> stated another way: 643073 opcodes/second, or about 1.56us/op.
> >> calculating from CPU speed, this is around 3604 clock cycles / opcode
> >> (CPU =
> >> 2.31 GHz).
> >
> > To provide another data point:
> >
> > First, some data from /proc/cpuinfo:
> >
> > model name : AMD Athlon(tm) Dual Core Processor 5050e
> > cpu MHz : 2600.000
> > cache size : 512 KB
> > bogomips : 5210.11
> >
>
> well, that is actually a faster processor than I am using...

Yes, it is. That's why I posted it. I am sure the results I got aren't
directly comparable to yours, and the different CPU is one of the reasons.

> > I did a quick test using the Alchemist code generation library. The
> > instruction sequence I generated is:
> >
> > 00000000 33C0 xor eax,eax
> > 00000002 40 inc eax
> > 00000003 33DB xor ebx,ebx
> > 00000005 83CB2A or ebx,byte +0x2a
> > 00000008 CD80 int 0x80
> >
> > for a total of 10 bytes. Doing this 100000000 (a hundred million) times
> > takes about 4.7 seconds.
> >
>
> I don't know the bytes output, I was measuring bytes of textual-ASM input:
> "num_loops * strlen(input);" essentially.

Oh, I see. I misunderstood you there. I thought you would be measuring
bytes of output, because your input likely wouldn't be the same size for
textual input vs. binary input.

Of course, that makes the MB/s figures we got completely incomparable.
I can't produce MB/s of input assembly code for my measurements, because,
in my case, there is no assembly code being used as input.

> in the structs-array case, I pre-parsed the example, but continued to
> measure against this sample (as-if it were still being assembled each time).

Right. I could, of course, come up with some assembly code corresponding to
the instructions that I'm generating, but I don't see much point to that.
First of all, the size would vary based on how you wrote the assembly code,
and, secondly, I'm not actually processing the assembly code at all, so
I don't think the numbers would be meaningful even as an approximation.

> > Using the same metrics that you provided, that is:
> >
> > About 200 MB/s
> > About 100 million opcodes generated per second
> > About 24 CPU clock cycles per opcode generated
> >
>
>
> yeah, but they are probably doing something differently.

Clearly, with the numbers being so different. :-) The point of posting these
numbers wasn't so much to show that the same thing you are doing can be
done in fewer instructions, but rather to give an idea of how much time
the generation of executable code costs using Alchemist. This is basically
the transition from "I know which instruction I want and which operands
I want to pass to it" to "I have the instruction at this address in memory".
In particular, Alchemist does _not_ parse assembly code, perform I/O,
have a concept of labels, or decide what kind of jump instruction you need.

> I found an "alchemist code generator", but it is a commercial app which
> processes XML and uses an IDE, so maybe not the one you are referencing
> (seems unlikely).

Right. The one I am talking about is at http://libalchemist.sourceforge.net/

> my lib is written in C, and as a general rule has not been "micro-turned for
> max performance" or anything like this (and also is built with MSVC, with
> debug settings).

Right, I forgot to mention my compiler settings. The results I posted
are using gcc 4.4.1-4ubuntu9, with -march-native -pipe -Wall -s -O3
-fPIC. So that's with quite a lot of optimization, although the code for
Alchemist hasn't been optimized for performance at all.

> emitting each byte is still a function call, and may check for things like
> the need to expand the buffer, ...

I expect that this may be costly, especially with debug settings enabled.
Alchemist doesn't make a function call for each byte emitted and doesn't
automatically expand the buffer, but it does perform a range check.

> the output is still packaged into COFF objects (though little related to
> COFF is all that notable on the profiler).

Right. Alchemist doesn't know anything about object file formats. It just
gives you the raw machine code.

> there is very little per-instruction logic (such as instruction-specific
> emitters), since this is ugly and would have made the thing larger and more
> complicated (but, granted, it would have been technically faster).

That may be a major difference, too. Alchemist has different functions for
emitting different kinds of instruction. For reference, the code that
emits the "or ebx,byte +0x2a" instruction above looks like this:

/* or ebx, 42 */
n += cg_x86_emit_reg32_imm8_instr(code + n,
sizeof(code) - n,
CG_X86_OP_OR,
CG_X86_REG_EBX,
42);

There are other functions for emitting code, with names like
cg_x86_emit_reg32_reg32_instr, cg_x86_emit_imm8_instr, etc.

Each of these functions contains a switch statement that looks at the
operation (an int) and then calls an instruction-format-specific function,
substituting the actual x86 opcode for the symbolic constant. A similar
scheme is used to translate the symbolic constant for a register name to
an actual x86 register code.

You can take a look at
http://repo.or.cz/w/alchemist.git/blob/143561d2347d492c570cde96481bac725042186c:/x86/lib/x86.c
for all the gory details, if you like.

Cheers,

Bob

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs