|
From: Skybuck Flying on 23 Apr 2008 07:52 function SimulatedDiv7( Content : longword ) : longword; inline; var vEightDivisions : longword; begin result := 0; // latency 1 while Content > 7 do // latency 1 + latency 1 = latency 2 begin vEightDivisions := (Content shr 3); // latency 1 + latency 1 = latency 2 Content := (Content and 7) + vEightDivisions; // latency 1 + latency 1 = latency 2 result := result + vEightDivisions; // latency 1 end; // single loop latency: 2 + 2 + 2 + 1 = 7 if Content = 7 then // latency 1 + latency 1 = latency 2 begin result := result + 1; // latency 1 end; // best case latency: 1 + 7 + 3 = 11 // worst case latency (max loops 11 for 32 bit longwords): 1 + 7 * 11 + 3 = 81 // very close to div 7, though this routine is probably directpath mostly and not vector path // so maybe it's faster... but then again still some calling overhead.. let's see what happens // when it's inlined ;) // yup inlining works real nice. // time for a benchmark :) // tested on amd x2 3800+ // for content values ranging from say 1000 to even a million this routine appears // to be faster than the div... kinda nice. end;
From: Skybuck Flying on 23 Apr 2008 09:09 I suspected the algorithm was general. To prove it's general, here is a general version which works with any divisor. I didn't bother optimizing it though. Just a proof of concept ;) No documentation is provided for the algorithm by me... that's the fun part for you... try to figure out how it works :) LOL. If you really interested in it, and can't figure it out... let me know I might tell you LOL. // Quickly written and quickly tested, not thoroughly tested. // general version just out of curiosity... function SimulatedDivGeneral( Content : longword; Divisor : longword ) : longword; var vTemp : longword; vDivisions : longword; vBiggerDivisor : longword; vShifter : byte; vAnder : longword; vDivisorDifference : longword; begin result := 0; // latency 1 // search for bigger binary/shift divisor the the current divisor vBiggerDivisor := 1; vShifter := 0; vAnder := 0; vTemp := Divisor; while vTemp > 0 do begin vBiggerDivisor := vBiggerDivisor shl 1; vShifter := vShifter + 1; vAnder := vAnder shl 1 + 1; vTemp := vTemp shr 1; end; vDivisorDifference := vBiggerDivisor - Divisor; // main algorithm while Content > vBiggerDivisor do // latency 1 + latency 1 = latency 2 begin vDivisions := (Content shr vShifter); // latency 1 + latency 1 = latency 2 Content := (Content and vAnder) + vDivisions * vDivisorDifference; // latency 1 + latency 1 = latency 2 result := result + vDivisions; // latency 1 end; // single loop latency: 2 + 2 + 2 + 1 = 7 if Content >= Divisor then // latency 1 + latency 1 = latency 2 begin result := result + 1; // latency 1 end; // best case latency: 1 + 7 + 3 = 11 // worst case latency (max loops 11 for 32 bit longwords): 1 + 7 * 11 + 3 = 81 // very close to div 7, though this routine is probably directpath mostly and not vector path // so maybe it's faster... but then again still some calling overhead.. let's see what happens // when it's inlined ;) // yup inlining works real nice. // time for a benchmark :) // tested on amd x2 3800+ // for content values ranging from say 1000 to even a million this routine appears // to be faster than the div... kinda nice. // general routine is slower... didn't bother to optimize it or so ;) end; Bye, Skybuck.
From: Skybuck Flying on 23 Apr 2008 22:27 Care to share your benchmark program ? :) Such a big difference. Maybe it's the way you benchmarked it ;) Read the notes though: It's only faster for content range 0 to 1 million or so. Bye, Skybuck.
From: Chuck Crayne on 23 Apr 2008 23:59 On Thu, 24 Apr 2008 04:27:50 +0200 "Skybuck Flying" <BloodyShame(a)hotmail.com> wrote: > Care to share your benchmark program ? :) Care to post the generated asm code? Maybe we can speed it up even more. -- Chuck http://www.pacificsites.com/~ccrayne/charles.html
From: Skybuck Flying on 24 Apr 2008 00:54 No problemo: :) Project1.dpr.267: vResult := SimulatedDiv7v3( vIndex ); 0040902D 8BC6 mov eax,esi 0040902F 33C9 xor ecx,ecx 00409031 83F807 cmp eax,$07 00409034 7611 jbe $00409047 00409036 8BD0 mov edx,eax 00409038 C1EA03 shr edx,$03 0040903B 83E007 and eax,$07 0040903E 03C2 add eax,edx 00409040 03CA add ecx,edx 00409042 83F807 cmp eax,$07 00409045 77EF jnbe $00409036 00409047 83F807 cmp eax,$07 0040904A 7501 jnz $0040904d 0040904C 41 inc ecx 0040904D 8BD9 mov ebx,ecx Bye, Skybuck. "Chuck Crayne" <ccrayne(a)crayne.org> wrote in message news:20080423205904.4166fff5(a)thor.crayne.org... > On Thu, 24 Apr 2008 04:27:50 +0200 > "Skybuck Flying" <BloodyShame(a)hotmail.com> wrote: > >> Care to share your benchmark program ? :) > > Care to post the generated asm code? Maybe we can speed it up even more. > > -- > Chuck > http://www.pacificsites.com/~ccrayne/charles.html > >
|
Next
|
Last
Pages: 1 2 3 4 5 6 Prev: delete bytes Next: Skybuck presents alternative division method, to be benchmarked ;) |