Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance
From: BGB / cr88192 on 2 Nov 2009 18:54 "Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message news:hcnjhr$72k$1(a)aioe.org... > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message > news:hclak6$4ej$1(a)news.albasani.net... >> >> at which point the main switch became the bottleneck... > > How so? Too many case values? Too widely dispersed case values which > prevents optimization? What? > it jumped near the top of the list in the profiler... basically, this is because this is a more or less central location through nearly all control flows (before branching off into all of the deeper internals of the interpreter). this is because, one may eliminate slowdowns in one place, and the app gets overall faster, but in terms of the profiler, the load has shifted somewhere else, and one could then normally optimize this location for yet further gains. sometimes though, it will shift to code which can't be optimized, and in an interpreter, when the top load shifts to the main switch statement (AKA: the central part driving operation of the interpreter), it is my observation that often one is essentially rapidly approaching the optimizability of an interpreter. from the POV of further optimization, it is usually better if the running time is mostly in leaf functions, since this case is usually much easier to optimize (for example, by optimizing the caller such that they are called less often). from the POV of the main loop, there is only a single complexity: O(n). > > Rod Pemberton > > > > >
From: BGB / cr88192 on 2 Nov 2009 19:03 "Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message news:hcnjj4$755$1(a)aioe.org... > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message > news:hckpa8$aj2$1(a)news.albasani.net... >> >> but, the hash is not used for opcode lookup/decoding, rather it is used >> for grabbing already decoded instructions from a cache (which is >> based on memory address). >> > > Your hash generates, what, 64k of possible hash values or memory > locations? > What if you reduce the size to 4k? 4k/sizeof(void *)? Will this allow to > compiler to simplify the generated assembly? > > The randomness doesn't have to come from multiplication. It can come from > other sources such as a lookup array of randomized data. > I tried both ways, but it does not seem to make much difference between 4k and 64k for the hash. I used 64k figureing it would scale better, but then thinking of it, a full hash might end up using an unreasonably large amount of memory (4MB? 16MB? more?...), whereas a 4k hash is self-limiting I guess (maybe ~1MB, assuming around 256 bytes per decode-op...). actually, I checked, with the current size of the DecodeOp structure, the current memory limit is around 2MB with a 64k hash, and would be around 128kB with a 4k hash... note that, generally, a multiplication is cheaper than an array lookup, however an array lookup is generally cheaper than a division or modulo... > > Rod Pemberton > > > >
From: BGB / cr88192 on 2 Nov 2009 19:29
"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message news:hcnjjs$7af$1(a)aioe.org... > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message > news:hcjarj$vro$1(a)news.albasani.net... >> >> hash EIP (currently: "((EIP*65521)>>16)&65535"); >> > > The shift truncates the value to the mask size. I.e., &65535 is not > needed. > Yes? > errm, this would be if I were doing these calculations with 32-bit unsigned arithmetic... actually, most of my addressing calculations are being done with 64-bit signed arithmetic mostly since the interpreter may also handle simulated long-mode, and because 64-bit arithmetic is far less prone to issues related to overflow behavior... (ctx->eip is usually 32-bit EIP, but may also be a 64-bit RIP, FWIW...). granted, if I were to compile my code on a 32-bit system, this would likely hurt performance fairly severe... |