status update: x86 interpreter... [ASM]

Prev: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance

From: BGB / cr88192 on 10 Nov 2009 20:40

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hdcr5j$dis$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hdc5b7$p2m$1(a)news.albasani.net...
>>
>> [POSIX support code]
>>
>> well, what is to say they are compatible with my implementation?...
>>
>
> Nothing... But, you mentioned PDPCLIB somewhere. I don't think it has
> them
> either, does it?
>

checking the provided links...

checking:
yeah... a lot of this code is almost as old as I am...
also generally pre-ANSI, ...

the readdir code apparently opens directories as files, and reads them
directly.
this much will not exactly work in my case, as my VFS does not exactly
implement "this" ability.

but, maybe a few 'bits and pieces' could be usable though...

>
> Rod Pemberton
>
>

From: Rod Pemberton on 11 Nov 2009 10:47

"BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
news:hdd4n6$bcj$1(a)news.albasani.net...
>
> a lot of this code is almost as old as I am...

:-)

The GNU project started in 1983. The Linux kernel started in 1991.

Hmm, I wonder if can find code from 1970's in a modern GNU codebase such as
DJGPP? Yes. After five minutes of looking, I found 1976 in GAWK's io.c.

Copyright dates from the code in the DJGPP (GNU codebase) compiler: 2004,
2003, 2001, 2000, (many 1990's), 1989, 1987, 1986, 1985, 1984, 1983, ...,
1980, 1979, 1976

Admittedly, many of the dates are not of the compiler proper (GCC), but of
other GNU packages, e.g., GAWK, and DJGPP contributed packages. The bulk of
the dates in DJGPP seem to be from 1987 to 2002, with many around the early
1990's (Linux), mid-to-late 1990's, and early 2000's. DJGPP also uses it's
own C library, so my search didn't get dates for GLIBC. I'd suspect some
old code still lingers there.

Rod Pemberton

From: BGB / cr88192 on 11 Nov 2009 13:43

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hdemdf$ft7$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hdd4n6$bcj$1(a)news.albasani.net...
>>
>> a lot of this code is almost as old as I am...
>
> :-)
>
> The GNU project started in 1983. The Linux kernel started in 1991.
>

yep...

nevermind that the GNU project started shortly before I was born, and I was
not exactly all that old when Linux started...

> Hmm, I wonder if can find code from 1970's in a modern GNU codebase such
> as
> DJGPP? Yes. After five minutes of looking, I found 1976 in GAWK's io.c.
>

yes, maybe this being why K&R style declarations never die...

> Copyright dates from the code in the DJGPP (GNU codebase) compiler: 2004,
> 2003, 2001, 2000, (many 1990's), 1989, 1987, 1986, 1985, 1984, 1983, ...,
> 1980, 1979, 1976
>
> Admittedly, many of the dates are not of the compiler proper (GCC), but of
> other GNU packages, e.g., GAWK, and DJGPP contributed packages. The bulk
> of
> the dates in DJGPP seem to be from 1987 to 2002, with many around the
> early
> 1990's (Linux), mid-to-late 1990's, and early 2000's. DJGPP also uses
> it's
> own C library, so my search didn't get dates for GLIBC. I'd suspect some
> old code still lingers there.
>

yes, ok.

I did go and beat together readdir support now, where a lot more of this
work is going on in the interpreter than in the VM code...

one 'slight' issue that popped up was that the interface I am using (in
native land) has no analogue of seekdir/telldir, but quickly enough I had
the idea that 'readdir' would keep count of how many directory entries it
had seen, and seekdir would basically just do a rewinddir followed by N
readdir calls (thus meaning no need to go bother with all of the costs of
going and adding this concept to the lower end APIs...).

involved in the process was a rather severe rewrite of the native-land
file-IO code (mostly changing the method of interfacing the interpreter with
my VFS), ...

the header I am using (came from MinGW, also PD), also includes _WDIR,
_wreaddir, ... (apparently for unicode versions), and I am not sure if I
should bother implementing them.

technically, in my case the 'ASCII' versions are using UTF-8 (actually, it
is the 'Modified UTF-8' scheme from the JVM, which is sort of the de-facto
charset in my codebase), so the main issue would be to add code for doing
string conversions to this library.

however, if I were to do this, I would likely have to address pdpclib's lack
of 'wstring.h' (AKA: go and find some "nice" way to address adding in C99
functionality), ...

the original author expressed doubts about adding C99 functionality (I guess
because it could interfere with simplicity and usability on more
minimalistic systems), and I would personally like to avoid an unecassary
fork, however, I would like to avoid needing a separate DLL for this (AKA: I
would like it if all the core C runtime stuff was in the same DLL), ...

oh well, other details:
for some not easily understood reason, using a linear search is currently
faster for address-mapping resolution than a binary search.

my guess is that with (currently) only about 5 mappings (the EXE, the
pdpclib and vxcore DLLs, the stack, and the heap), the overhead of the
binary search is higher than that of just a plain 'for' loop.

actually, I think it is because the linear loop in this case will find the
item in an average of 2.5 steps, whereas the binary search would take 3
steps + a final adjustment.

I added partial address randomization (currently for heap and stack), but
was having difficulty with randomizing the EXE and DLL's (I would get a VM
GPF...).

I realize now that I had not verified that the EXE was not
relocation-stripped, and in this case the relocations are necessary to
properly re-base the PE/COFF image (I guess I could add a check and test,
and if this is the case maybe look into telling the linker to produce
relocatable EXE's...).

I may also need to look into "minimum alignment" issues (for tests, I was
keeping a 16-byte alignment). I am not sure if any code can "reasonably"
depend on the exact alignment of its load address though ("I'm so totally
going to GPF if my global array is not 4KiB aligned..."), so maybe it is ok
for now.

just checked, yep.

it now verifies that the image has relocs before randomizing the address,
and this makes the thing work. granted, in this case, the EXE is not
randomized, but DLL's seem to work ok...

basically, this is a simplistic version, where it mostly just jitters things
in 16-byte steps, with a max of 256 steps, meaning a 4KiB overhead per
structure (loaded DLL, stack, heap-segment, ...). this should be small
enough to be ignorable (an 0.4% heap overhead, ...).

or such...

From: Rod Pemberton on 12 Nov 2009 00:02

"BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
news:hdf0kv$d18$1(a)news.albasani.net...
>
> I did go and beat together readdir support now

Already? Young dude, you are fast...

> however, if I were to do this, I would likely have to address pdpclib's
lack
> of 'wstring.h' (AKA: go and find some "nice" way to address adding in C99
> functionality), ...

I'm not sure about wstring.h, but I'm sure I posted the link a few times for
Doug Gwyn's "Instant C9x":
http://www.lysator.liu.se/c/q8/index.html

> the original author expressed doubts about adding C99 functionality (I
guess
> because it could interfere with simplicity and usability on more
> minimalistic systems),

PDPCLIB? That could be one issue. Another issue might be size. Size is
why Paul Edwards (kerravon) said they are using PDPCLIB for GCCMVS project
instead of GLIBC. Sigh. That's one more thing I still need to do: setup an
MVS380 environment so I can test some of my code with EBCDIC...

http://gccmvs.sourceforge.net/
http://mvs380.sourceforge.net/
http://sourceforge.net/projects/pdos/

> and I would personally like to avoid an unecassary
> fork, however, I would like to avoid needing a separate DLL for this (AKA:
I
> would like it if all the core C runtime stuff was in the same DLL), ...

MSVCRT... Oops, wrong library. :)

> for some not easily understood reason, using a linear search is currently
> faster for address-mapping resolution than a binary search.
....

While I understand what big-O representation is for, I don't know what
either would be or which should take more time in this case. As the length
of the linear search gets longer, it should take more time. One would think
a binary search was more "compact" or shorter due to searching fewer nodes.
Binary trees can get quite large also. It's possible the binary tree is not
always the optimal method, but the optimal method on average. I.e.,
something at the end of a large linear search won't be optimal, but
something at the beginning will be. In a binary tree, many values will be
"average", i.e., the search will neither be very long, nor be very short.
I'm fairly sure the graph of a linear search is linear while the graph of
the binary search could be very chaotic plot or very smooth curve in any
direction or shape depending on how the nodes are balanced, where they are
inserted, or on what branch decision is used.

What I do know is that a repeat prefix on an x86 string instruction is very
fast. I believe it's fast enough that converting from a rep to a loop or
using a jcc branch, e.g., needed to implement some theoretically more
optimal search solution, can completely kill the performance of the
theoretically optimal solution. Now, it seems you didn't have much data in
your search. What happens if you put in lots of data just for kicks? Is my
belief correct? I.e., that the linear search with rep and string
instructions is faster than a binary search when it shouldn't be in
theory... Well, you're doing this in C, right? Never mind.

> I added partial address randomization (currently for heap and stack), but
> was having difficulty with randomizing the EXE and DLL's (I would get a VM
> GPF...).

In the compiler? I.e., not inside an OS. Interesting. It's probably the
best the place to do so. I'm unaware of anyone else doing so.

> I may also need to look into "minimum alignment" issues (for tests, I was
> keeping a 16-byte alignment). I am not sure if any code can "reasonably"
> depend on the exact alignment of its load address though ("I'm so totally
> going to GPF if my global array is not 4KiB aligned..."),

Self re-aligning code? Nah, you'd need the code segment to be rewritable...

Rod Pemberton

From: BGB / cr88192 on 12 Nov 2009 01:44

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hdg501$2gf$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hdf0kv$d18$1(a)news.albasani.net...
>>
>> I did go and beat together readdir support now
>
> Already? Young dude, you are fast...
>

it is mostly about going and patching things together...

the low-level VFS already had something analogous to readdir, so I first had
to be able to route it into the interpreter (via a new interface struct),
write some utility wrappers, and the logic necessary for the syscalls.

>> however, if I were to do this, I would likely have to address pdpclib's
> lack
>> of 'wstring.h' (AKA: go and find some "nice" way to address adding in C99
>> functionality), ...
>
> I'm not sure about wstring.h, but I'm sure I posted the link a few times
> for
> Doug Gwyn's "Instant C9x":
> http://www.lysator.liu.se/c/q8/index.html
>

didn't see this, but will look at it.

>> the original author expressed doubts about adding C99 functionality (I
> guess
>> because it could interfere with simplicity and usability on more
>> minimalistic systems),
>
> PDPCLIB? That could be one issue. Another issue might be size. Size is
> why Paul Edwards (kerravon) said they are using PDPCLIB for GCCMVS project
> instead of GLIBC. Sigh. That's one more thing I still need to do: setup
> an
> MVS380 environment so I can test some of my code with EBCDIC...
>
> http://gccmvs.sourceforge.net/
> http://mvs380.sourceforge.net/
> http://sourceforge.net/projects/pdos/
>

I don't support EBCDIC, so I presume to live in a world where only UTF-8
exists...

granted, proper C support requires baseline support for codepages/... but oh
well...

>> and I would personally like to avoid an unecassary
>> fork, however, I would like to avoid needing a separate DLL for this
>> (AKA:
> I
>> would like it if all the core C runtime stuff was in the same DLL), ...
>
> MSVCRT... Oops, wrong library. :)
>

MSVCRT is not open source, and likely depends on a lot of facilities I don't
provide (core Windows syscalls, ...).

>> for some not easily understood reason, using a linear search is currently
>> faster for address-mapping resolution than a binary search.
> ...
>
> While I understand what big-O representation is for, I don't know what
> either would be or which should take more time in this case. As the
> length
> of the linear search gets longer, it should take more time. One would
> think
> a binary search was more "compact" or shorter due to searching fewer
> nodes.
> Binary trees can get quite large also. It's possible the binary tree is
> not
> always the optimal method, but the optimal method on average. I.e.,
> something at the end of a large linear search won't be optimal, but
> something at the beginning will be. In a binary tree, many values will be
> "average", i.e., the search will neither be very long, nor be very short.
> I'm fairly sure the graph of a linear search is linear while the graph of
> the binary search could be very chaotic plot or very smooth curve in any
> direction or shape depending on how the nodes are balanced, where they are
> inserted, or on what branch decision is used.
>

I was doing the array version of a binary search (no trees, only a sorted
array).

the issue is that I think that the average number of steps is larger, since
for small n,
ceil(log2 n) > n/2

this is because the variant of linear search I use takes ceil(log2 n) steps
(I had failed to improve upon this, as an "early out check" costed a lot
more than it saved, ...), whereas on-average a linear search will find the
results "somewhere near the middle".

1 - -
2 1 1
3 2 1.5
4 2 2
5 3 2.5
6 3 3
7 3 3.5
8 3 4
9 4 4.5
10 4 5
11 4 5.5
12 4 6
13 4 6.5
14 4 7
15 4 7.5
16 4 8

so, for n<10 or so, it is really a toss up in terms of complexity.

for 5, binary has a slightly higher complexity than linear, as well as being
an otherwise more complicated piece of code...

similarly, I had also made a slight tweak in the linear search to exploit
the sorting (the first 'if' check either results in a 'continue', or if it
fails, indicates a terminal position in the loop meaning either the item is
found or absent).

so, for a binary search to be faster, n would need to be larger...

> What I do know is that a repeat prefix on an x86 string instruction is
> very
> fast. I believe it's fast enough that converting from a rep to a loop or
> using a jcc branch, e.g., needed to implement some theoretically more
> optimal search solution, can completely kill the performance of the
> theoretically optimal solution. Now, it seems you didn't have much data
> in
> your search. What happens if you put in lots of data just for kicks? Is
> my
> belief correct? I.e., that the linear search with rep and string
> instructions is faster than a binary search when it shouldn't be in
> theory... Well, you're doing this in C, right? Never mind.
>

yep...

this is not for string lookups, this is for mapping addresses to memory
spans.
hence, it is an operation performed pretty much every time an opcode
attempts to access a memory operand...

mov eax, [ebp+8]
add [eax+16], ecx
....

as is, this operation is also significant in the running time, but there
seems not much way at present to speed it up (even a hash would not likely
help here, as it would likely add more overhead than it saves...).

>> I added partial address randomization (currently for heap and stack), but
>> was having difficulty with randomizing the EXE and DLL's (I would get a
>> VM
>> GPF...).
>
> In the compiler? I.e., not inside an OS. Interesting. It's probably the
> best the place to do so. I'm unaware of anyone else doing so.
>

no, it is in the PE/COFF loader...

>> I may also need to look into "minimum alignment" issues (for tests, I was
>> keeping a 16-byte alignment). I am not sure if any code can "reasonably"
>> depend on the exact alignment of its load address though ("I'm so totally
>> going to GPF if my global array is not 4KiB aligned..."),
>
> Self re-aligning code? Nah, you'd need the code segment to be
> rewritable...
>

I meant, just loading in an EXE and DLL and shifting it by a random multiple
of 16 bytes, and doing the DLL rebase magic...

for now, it seems to work ok, but I am not sure if anything will depend on a
coarser alignment, and thus risk breaking...

>
> Rod Pemberton
>
>

First | Prev | Next | Last
Pages: 1 2 3
Prev: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance