the 'switch' limit... [ASM]

Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance

From: BGB / cr88192 on 2 Nov 2009 18:54

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hcnjhr$72k$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hclak6$4ej$1(a)news.albasani.net...
>>
>> at which point the main switch became the bottleneck...
>
> How so? Too many case values? Too widely dispersed case values which
> prevents optimization? What?
>

it jumped near the top of the list in the profiler...

basically, this is because this is a more or less central location through
nearly all control flows (before branching off into all of the deeper
internals of the interpreter).

this is because, one may eliminate slowdowns in one place, and the app gets
overall faster, but in terms of the profiler, the load has shifted somewhere
else, and one could then normally optimize this location for yet further
gains.

sometimes though, it will shift to code which can't be optimized, and in an
interpreter, when the top load shifts to the main switch statement (AKA: the
central part driving operation of the interpreter), it is my observation
that often one is essentially rapidly approaching the optimizability of an
interpreter.

from the POV of further optimization, it is usually better if the running
time is mostly in leaf functions, since this case is usually much easier to
optimize (for example, by optimizing the caller such that they are called
less often).

from the POV of the main loop, there is only a single complexity: O(n).

>
> Rod Pemberton
>
>
>
>
>

From: BGB / cr88192 on 2 Nov 2009 19:03

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hcnjj4$755$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hckpa8$aj2$1(a)news.albasani.net...
>>
>> but, the hash is not used for opcode lookup/decoding, rather it is used
>> for grabbing already decoded instructions from a cache (which is
>> based on memory address).
>>
>
> Your hash generates, what, 64k of possible hash values or memory
> locations?
> What if you reduce the size to 4k? 4k/sizeof(void *)? Will this allow to
> compiler to simplify the generated assembly?
>
> The randomness doesn't have to come from multiplication. It can come from
> other sources such as a lookup array of randomized data.
>

I tried both ways, but it does not seem to make much difference between 4k
and 64k for the hash.
I used 64k figureing it would scale better, but then thinking of it, a full
hash might end up using an unreasonably large amount of memory (4MB? 16MB?
more?...), whereas a 4k hash is self-limiting I guess (maybe ~1MB, assuming
around 256 bytes per decode-op...).

actually, I checked, with the current size of the DecodeOp structure, the
current memory limit is around 2MB with a 64k hash, and would be around
128kB with a 4k hash...

note that, generally, a multiplication is cheaper than an array lookup,
however an array lookup is generally cheaper than a division or modulo...

>
> Rod Pemberton
>
>
>
>

From: BGB / cr88192 on 2 Nov 2009 19:29

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hcnjjs$7af$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hcjarj$vro$1(a)news.albasani.net...
>>
>> hash EIP (currently: "((EIP*65521)>>16)&65535");
>>
>
> The shift truncates the value to the mask size. I.e., &65535 is not
> needed.
> Yes?
>

errm, this would be if I were doing these calculations with 32-bit unsigned
arithmetic...

actually, most of my addressing calculations are being done with 64-bit
signed arithmetic mostly since the interpreter may also handle simulated
long-mode, and because 64-bit arithmetic is far less prone to issues related
to overflow behavior...

(ctx->eip is usually 32-bit EIP, but may also be a 64-bit RIP, FWIW...).

granted, if I were to compile my code on a 32-bit system, this would likely
hurt performance fairly severe...

First | Prev |
Pages: 1 2 3 4 5
Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance