status update 1 (Re: assembler speed...) [ASM]

Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs

From: BGB / cr88192 on 28 Mar 2010 13:49

"Branimir Maksimovic" <bmaxa(a)hotmail.com> wrote in message
news:20100328191433.5cf5b2f1(a)maxa...
> On Sun, 28 Mar 2010 08:25:45 -0700
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote:
>

<snip>

>>
>> internally, it runs at 2.31 GHz I think, and this becomes more
>> notable when doing some types of benchmarks.
>>
>> my newer laptop has an Pentium 4M or similar, and outperforms my main
>> computer for raw computational tasks, but comes with rather lame
>> video HW (and so still can't really play any games much newer than
>> HL2, which runs similarly well on my old laptop despite my old laptop
>> being much slower in general...).
>
> Well, measured some quad xeon against dual athlon slower
> than your in initializing 256mb of ram 4 threads xeon, 2 threads
> athlon, same speed.
> Point is that same speed was with 3.2 ghz strongest dual athlon as well.
> Intel external memory controller models are slower with memory
> than athlons. You need to overclock to at least 400mhz FSB to compete
> with athlons.
>

well, whatever the case, my 2009-era laptop with an Pentium4 outperforms my
2007-era desktop with an Athlon 64 X2, at least for pure CPU tasks.

I haven't really compared them with memory-intensive tasks.

I put DDR-2 PC2-6400 RAM in my desktop, but the BIOS regards it as 5400 (as
does memtest86...).
I don't know what laptop uses.

for games, the main issue is video HW, as apparently the "Intel Mobile
Video" or whatever isn't exactly good...
main computer has a "Radeon HD 4850".

....

From: Branimir Maksimovic on 28 Mar 2010 14:04

On Sun, 28 Mar 2010 10:49:49 -0700
"BGB / cr88192" <cr88192(a)hotmail.com> wrote:

>
> "Branimir Maksimovic" <bmaxa(a)hotmail.com> wrote in message
> news:20100328191433.5cf5b2f1(a)maxa...
> > On Sun, 28 Mar 2010 08:25:45 -0700
> > "BGB / cr88192" <cr88192(a)hotmail.com> wrote:
> >
>
> <snip>
>
> >>
> >> internally, it runs at 2.31 GHz I think, and this becomes more
> >> notable when doing some types of benchmarks.
> >>
> >> my newer laptop has an Pentium 4M or similar, and outperforms my
> >> main computer for raw computational tasks, but comes with rather
> >> lame video HW (and so still can't really play any games much newer
> >> than HL2, which runs similarly well on my old laptop despite my
> >> old laptop being much slower in general...).
> >
> > Well, measured some quad xeon against dual athlon slower
> > than your in initializing 256mb of ram 4 threads xeon, 2 threads
> > athlon, same speed.
> > Point is that same speed was with 3.2 ghz strongest dual athlon as
> > well. Intel external memory controller models are slower with memory
> > than athlons. You need to overclock to at least 400mhz FSB to
> > compete with athlons.
> >
>
> well, whatever the case, my 2009-era laptop with an Pentium4
> outperforms my 2007-era desktop with an Athlon 64 X2, at least for
> pure CPU tasks.
>
Intel core/2 is much faster than athlon per CPu tasks clock per clock,
when data is in cache, but ahtlon is faster regarding
when you have to write lot of data at same time.
That's why intel has larger cache to compensate that.
i7 changed that as it has internal memory controller.

Greets!

--
http://maxa.homedns.org/

Sometimes online sometimes not

From: BGB / cr88192 on 28 Mar 2010 14:22

"Robbert Haarman" <inglorion(a)inglorion.net> wrote in message
news:20100328173713.GD3467(a)yoda.inglorion.net...
> Hi cr,
>
> On Sun, Mar 28, 2010 at 09:07:05AM -0700, BGB / cr88192 wrote:
>>
>> "Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
>> news:20100328074138.GA3467(a)yoda.inglorion.net...
>> > On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote:
>> >>
>> >> "cr88192" <cr88192(a)hotmail.com> wrote in message
>> >> news:hofus4$a0r$1(a)news.albasani.net...
>> >>
>> >> 10MB/s (analogue) can be gained by using a direct binary interface
>> >> (newly
>> >> added).
>> >> in the case of this mode, most of the profile time goes into a few
>> >> predicate
>> >> functions, and also the function for emitting opcode bytes. somehow, I
>> >> don't
>> >> think it is likely to be getting that much faster.
>> >>
>> >> stated another way: 643073 opcodes/second, or about 1.56us/op.
>> >> calculating from CPU speed, this is around 3604 clock cycles / opcode
>> >> (CPU =
>> >> 2.31 GHz).
>> >
>> > To provide another data point:
>> >
>> > First, some data from /proc/cpuinfo:
>> >
>> > model name : AMD Athlon(tm) Dual Core Processor 5050e
>> > cpu MHz : 2600.000
>> > cache size : 512 KB
>> > bogomips : 5210.11
>> >
>>
>> well, that is actually a faster processor than I am using...
>
> Yes, it is. That's why I posted it. I am sure the results I got aren't
> directly comparable to yours, and the different CPU is one of the reasons.
>

yep.

>> > I did a quick test using the Alchemist code generation library. The
>> > instruction sequence I generated is:
>> >
>> > 00000000 33C0 xor eax,eax
>> > 00000002 40 inc eax
>> > 00000003 33DB xor ebx,ebx
>> > 00000005 83CB2A or ebx,byte +0x2a
>> > 00000008 CD80 int 0x80
>> >
>> > for a total of 10 bytes. Doing this 100000000 (a hundred million) times
>> > takes about 4.7 seconds.
>> >
>>
>> I don't know the bytes output, I was measuring bytes of textual-ASM
>> input:
>> "num_loops * strlen(input);" essentially.
>
> Oh, I see. I misunderstood you there. I thought you would be measuring
> bytes of output, because your input likely wouldn't be the same size for
> textual input vs. binary input.
>
> Of course, that makes the MB/s figures we got completely incomparable.
> I can't produce MB/s of input assembly code for my measurements, because,
> in my case, there is no assembly code being used as input.
>

yes.

I can't directly produce (meaningful) bytes of output either, since the
output is currently in the form of unlinked COFF objects...

>> in the structs-array case, I pre-parsed the example, but continued to
>> measure against this sample (as-if it were still being assembled each
>> time).
>
> Right. I could, of course, come up with some assembly code corresponding
> to
> the instructions that I'm generating, but I don't see much point to that.
> First of all, the size would vary based on how you wrote the assembly
> code,
> and, secondly, I'm not actually processing the assembly code at all, so
> I don't think the numbers would be meaningful even as an approximation.
>

yep.

>> > Using the same metrics that you provided, that is:
>> >
>> > About 200 MB/s
>> > About 100 million opcodes generated per second
>> > About 24 CPU clock cycles per opcode generated
>> >
>>
>>
>> yeah, but they are probably doing something differently.
>
> Clearly, with the numbers being so different. :-) The point of posting
> these
> numbers wasn't so much to show that the same thing you are doing can be
> done in fewer instructions, but rather to give an idea of how much time
> the generation of executable code costs using Alchemist. This is basically
> the transition from "I know which instruction I want and which operands
> I want to pass to it" to "I have the instruction at this address in
> memory".
> In particular, Alchemist does _not_ parse assembly code, perform I/O,
> have a concept of labels, or decide what kind of jump instruction you
> need.
>

mine does all this apart from the IO.

input and output is passed as buffers, although input can be done into the
assembler via "print" statements, which are buffered internally, which is
one of the main ways of using the assembler.

trivially different is the "puts" command, which doesn't do any formatting,
and hence is a little faster if the code is pre-formed.

>> I found an "alchemist code generator", but it is a commercial app which
>> processes XML and uses an IDE, so maybe not the one you are referencing
>> (seems unlikely).
>
> Right. The one I am talking about is at
> http://libalchemist.sourceforge.net/
>

ok.

>> my lib is written in C, and as a general rule has not been "micro-turned
>> for
>> max performance" or anything like this (and also is built with MSVC, with
>> debug settings).
>
> Right, I forgot to mention my compiler settings. The results I posted
> are using gcc 4.4.1-4ubuntu9, with -march-native -pipe -Wall -s -O3
> -fPIC. So that's with quite a lot of optimization, although the code for
> Alchemist hasn't been optimized for performance at all.
>

yeah.

MSVC's performance generally falls behind GCC's in my tests anyways...

>> emitting each byte is still a function call, and may check for things
>> like
>> the need to expand the buffer, ...
>
> I expect that this may be costly, especially with debug settings enabled.
> Alchemist doesn't make a function call for each byte emitted and doesn't
> automatically expand the buffer, but it does perform a range check.
>

the range check is used, and typically realloc is used if the buffer needs
to expand.
the default initial buffers are 4kB and expand by a factor of 1.5, and with
the example I am using this shouldn't be an issue.

>> the output is still packaged into COFF objects (though little related to
>> COFF is all that notable on the profiler).
>
> Right. Alchemist doesn't know anything about object file formats. It just
> gives you the raw machine code.

yep, and mine produces objects which will be presumably passed to the
dynamic linker (but other common uses include writing them to files, ...).

my tests have typically excluded the dynamic linker, as it doesn't seem to
figure heavily in the benchmarks, would be difficult to benchmark, and also
tends to crash after relinking the same module into the image more than a
few k times in a row (I suspect it is likely using up too much memory or
similar...).

>> there is very little per-instruction logic (such as instruction-specific
>> emitters), since this is ugly and would have made the thing larger and
>> more
>> complicated (but, granted, it would have been technically faster).
>
> That may be a major difference, too. Alchemist has different functions for
> emitting different kinds of instruction. For reference, the code that
> emits the "or ebx,byte +0x2a" instruction above looks like this:
>
> /* or ebx, 42 */
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> There are other functions for emitting code, with names like
> cg_x86_emit_reg32_reg32_instr, cg_x86_emit_imm8_instr, etc.
>
> Each of these functions contains a switch statement that looks at the
> operation (an int) and then calls an instruction-format-specific function,
> substituting the actual x86 opcode for the symbolic constant. A similar
> scheme is used to translate the symbolic constant for a register name to
> an actual x86 register code.
>
> You can take a look at
> http://repo.or.cz/w/alchemist.git/blob/143561d2347d492c570cde96481bac725042186c:/x86/lib/x86.c
> for all the gory details, if you like.
>

mine works somewhat differently then.

in my case, the opcode number is used, and then the specific form of the
instruction for the given arguments is looked up (typically using
predicate-based matchers), and this results in a string which tells how to
emit the bytes for the opcode.

this string is passed to the "OutBodyBytes" function, which follows the
commands in the string (typically single letters telling where to put
size/addr/REX/... prefixes, apart for XOP and AVX instructions which are
special and may use several additional characters to define the specific
prefix), and outputs literal bytes (typically represented in the command
string as hex values).

each byte is emitted via "OutByte", which deals with matters of putting the
byte into the correct section, checking if the buffer for that section needs
to expand, ...

or, IOW, it is a more generic assembler...

From: Waldek Hebisch on 29 Mar 2010 16:50

In comp.lang.misc BGB / cr88192 <cr88192(a)hotmail.com> wrote:
>
> "cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hofus4$a0r$1(a)news.albasani.net...
> > well, this was a recent argument on comp.compilers, but I figured it may
> > make some sense in a "freer" context.
> >
>
> well, a status update:
> 1.94 MB/s is the speed which can be gained with "normal" operation (textual
> interface, preprocessor, jump optimization, ...);
> 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
> forces single-pass assembly.
>
>
> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> added).
> in the case of this mode, most of the profile time goes into a few predicate
> functions, and also the function for emitting opcode bytes. somehow, I don't
> think it is likely to be getting that much faster.
>
> stated another way: 643073 opcodes/second, or about 1.56us/op.
> calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
> 2.31 GHz).
>

For a litte comparison: Poplog needs 0.24s to compile about
20000 lines of high-level code generating about 2.4 MB of
image. Only part of generated image is instructions, rest
is data and relocation info. Conservative estimate is about
10 machine instructions per high-level line, which gives
about 200000 instructions, that is about 800000 istructions
per second.

Poplog generates machine code from binary intermediate form
(slightly higher level than assembler, typically one
intermediate operation generates 1-3 machine instructions).
Code is generated in multiple passes, at least two, in next
to last pass code generator computes size of code, then
buffer of appropriate size is allocated and in final pass
code is emmited to the buffer.

Code generator can not generate arbitrary x86 instructions,
just the ones needed to express intermediate operations.
Bytes are emmited via function calls, opcodes and modes
are symbolic constants (textual in source, but integers
in compiled form).

My feeling is that trying to use strings as intermediate form
(or even "character based dispatch") would significantly
slow down code generator and the whole compiler.

BTW: I tried this on 2.4 GHz Core 2. The machine is quad
core, but Poplog uses only one. L2 cache is 4MB per two cores
(one pair of cores shares one cache on one die, another pair
of cores is on second die and has its own cache). IME Core 2
is significantly (about 20-30% faster than similarly clocked
Athlon 64 (I have no comparison with newer AMD processors)),
so the results are not directly comparable with yours.

--
Waldek Hebisch
hebisch(a)math.uni.wroc.pl

From: BGB / cr88192 on 29 Mar 2010 18:54

"Waldek Hebisch" <hebisch(a)math.uni.wroc.pl> wrote in message
news:hor3rb$jv2$1(a)z-news.wcss.wroc.pl...
> In comp.lang.misc BGB / cr88192 <cr88192(a)hotmail.com> wrote:
>>
>> "cr88192" <cr88192(a)hotmail.com> wrote in message
>> news:hofus4$a0r$1(a)news.albasani.net...
>> > well, this was a recent argument on comp.compilers, but I figured it
>> > may
>> > make some sense in a "freer" context.
>> >
>>
>> well, a status update:
>> 1.94 MB/s is the speed which can be gained with "normal" operation
>> (textual
>> interface, preprocessor, jump optimization, ...);
>> 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor
>> and
>> forces single-pass assembly.
>>
>>
>> 10MB/s (analogue) can be gained by using a direct binary interface (newly
>> added).
>> in the case of this mode, most of the profile time goes into a few
>> predicate
>> functions, and also the function for emitting opcode bytes. somehow, I
>> don't
>> think it is likely to be getting that much faster.
>>
>> stated another way: 643073 opcodes/second, or about 1.56us/op.
>> calculating from CPU speed, this is around 3604 clock cycles / opcode
>> (CPU =
>> 2.31 GHz).
>>
>
> For a litte comparison: Poplog needs 0.24s to compile about
> 20000 lines of high-level code generating about 2.4 MB of
> image. Only part of generated image is instructions, rest
> is data and relocation info. Conservative estimate is about
> 10 machine instructions per high-level line, which gives
> about 200000 instructions, that is about 800000 istructions
> per second.
>

ok.

> Poplog generates machine code from binary intermediate form
> (slightly higher level than assembler, typically one
> intermediate operation generates 1-3 machine instructions).
> Code is generated in multiple passes, at least two, in next
> to last pass code generator computes size of code, then
> buffer of appropriate size is allocated and in final pass
> code is emmited to the buffer.
>
> Code generator can not generate arbitrary x86 instructions,
> just the ones needed to express intermediate operations.
> Bytes are emmited via function calls, opcodes and modes
> are symbolic constants (textual in source, but integers
> in compiled form).
>

granted, direct byte-for-byte output is much faster than what I am doing,

> My feeling is that trying to use strings as intermediate form
> (or even "character based dispatch") would significantly
> slow down code generator and the whole compiler.
>

it depends a lot though as to how much of the overall time would actually go
into this.
text is a lot more expensive in cases where little else is going on, but is
a bit cheaper in cases where there is a large amount of logic code in the
mix.

in the case of an assembler though, the amount of internal logic is
comparatively smaller, and so string-processing tasks are overall more
expensive...

but, the bigger question here is not which is faster, but rather which
offers a better set of tradeoffs.

direct binary APIs tend to be far less generic than an assembler, for
example, they will be specialized to a particular code generator, ... and so
not as useful for general-purpose tasks (say, multiple code generators using
the same assembler, some input coming from files, ...).

it is much like how XML is not as fast to work with as S-Expressions either,
but XML is more flexible, thus making it more favorable despite its slower
speeds.

> BTW: I tried this on 2.4 GHz Core 2. The machine is quad
> core, but Poplog uses only one. L2 cache is 4MB per two cores
> (one pair of cores shares one cache on one die, another pair
> of cores is on second die and has its own cache). IME Core 2
> is significantly (about 20-30% faster than similarly clocked
> Athlon 64 (I have no comparison with newer AMD processors)),
> so the results are not directly comparable with yours.
>

yeah.

I am not so familiar with Poplog either though.

> --
> Waldek Hebisch
> hebisch(a)math.uni.wroc.pl

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs