|
Prev: Skybuck presents Faster Div 7
Next: begginer question
From: Skybuck Flying on 24 Apr 2008 10:01 The search is easily optimized if leading zero count instruction is available: vBiggerDivisor := 1 shl 32-LeadingZeroCount; vShifter := 1 shl 32-LeadingZeroCount; vAnder := $FFFFFFFF shr LeadingZeroCount; No more while vTemp > 0 loop necessary. Interesting. Now I go look for a leading zero count replacement for my AMD X2 3800+ (it doesn't have the instruction :)) Then I am gonna give it another benchmark try ;) :) Bye, Skybuck. "Skybuck Flying" <BloodyShame(a)hotmail.com> wrote in message news:230c6$480f33a2$541983fa$18551(a)cache4.tilbu1.nb.home.nl... >I suspected the algorithm was general. > > To prove it's general, here is a general version which works with any > divisor. > > I didn't bother optimizing it though. > > Just a proof of concept ;) > > No documentation is provided for the algorithm by me... that's the fun > part for you... try to figure out how it works :) LOL. > > If you really interested in it, and can't figure it out... let me know I > might tell you LOL. > > // Quickly written and quickly tested, not thoroughly tested. > > // general version just out of curiosity... > function SimulatedDivGeneral( Content : longword; Divisor : longword ) : > longword; > var > vTemp : longword; > vDivisions : longword; > vBiggerDivisor : longword; > vShifter : byte; > vAnder : longword; > vDivisorDifference : longword; > begin > result := 0; // latency 1 this piece not necessary anymore and can be replaced by the 4 (?) instructions above. Very nice. > // search for bigger binary/shift divisor the the current divisor > vBiggerDivisor := 1; > vShifter := 0; > vAnder := 0; > > vTemp := Divisor; > > while vTemp > 0 do > begin > vBiggerDivisor := vBiggerDivisor shl 1; > vShifter := vShifter + 1; > vAnder := vAnder shl 1 + 1; > vTemp := vTemp shr 1; > end; > > vDivisorDifference := vBiggerDivisor - Divisor; > > // main algorithm > while Content > vBiggerDivisor do // latency 1 + latency 1 = latency 2 > begin > vDivisions := (Content shr vShifter); // latency 1 + latency 1 = latency > 2 > Content := (Content and vAnder) + vDivisions * vDivisorDifference; // > latency 1 + latency 1 = latency 2 > result := result + vDivisions; // latency 1 > end; > > // single loop latency: 2 + 2 + 2 + 1 = 7 > if Content >= Divisor then // latency 1 + latency 1 = latency 2 > begin > result := result + 1; // latency 1 > end; > > // best case latency: 1 + 7 + 3 = 11 > // worst case latency (max loops 11 for 32 bit longwords): 1 + 7 * 11 + 3 > = 81 > > // very close to div 7, though this routine is probably directpath mostly > and not vector path > // so maybe it's faster... but then again still some calling overhead.. > let's see what happens > // when it's inlined ;) > // yup inlining works real nice. > > // time for a benchmark :) > > // tested on amd x2 3800+ > // for content values ranging from say 1000 to even a million this routine > appears > // to be faster than the div... kinda nice. > > // general routine is slower... didn't bother to optimize it or so ;) > end; Bye, Skybuck.
From: Skybuck Flying on 24 Apr 2008 10:13 "Skybuck Flying" <BloodyShame(a)hotmail.com> wrote in message news:84595$4810914c$541983fa$16889(a)cache4.tilbu1.nb.home.nl... > The search is easily optimized if leading zero count instruction is > available: > > vBiggerDivisor := 1 shl 32-LeadingZeroCount; Hmm nope I think I made a little mistake here: > vShifter := 1 shl 32-LeadingZeroCount; This probably needs to be something like: vShifter := 32-LeadingZeroCount; I'll figure it out. Stay tuned :) Bye, Skybuck.
From: Skybuck Flying on 24 Apr 2008 10:29 Ok, This one slightly faster. However the leading zero count assembler function can't be inlined so this means calling overhead is added. To bad ;) Still it's slightly faster than the previous general function. I am just too lazy to write a pascal version which searches for the first bit set. I am gonna try that next... lol me so lazy :) // not sure if this assembler version will always work if compiler changes // but so far it seems to work ok ;) :) // to bad it can't be inlined... maybe try that later. function LeadingZeroCount( const Value : longword ) : longword; asm push ebx bsr ebx,eax mov eax,32 jz @done sub eax,ebx dec eax @done: pop ebx end; // general version just out of curiosity... function SimulatedDivGeneralV2( Content : longword; Divisor : longword ) : longword; inline; var vLeadingZeroCount : byte; vShifter : byte; vDivisions : longword; vBiggerDivisor : longword; vAnder : longword; vDivisorDifference : longword; begin result := 0; // latency 1 // replacement vLeadingZeroCount := LeadingZeroCount( Divisor ); vShifter := 32 - vLeadingZeroCount; vBiggerDivisor := 1 shl vShifter; vAnder := $FFFFFFFF shr vLeadingZeroCount; vDivisorDifference := vBiggerDivisor - Divisor; // main algorithm while Content > vBiggerDivisor do // latency 1 + latency 1 = latency 2 begin vDivisions := (Content shr vShifter); // latency 1 + latency 1 = latency 2 Content := (Content and vAnder) + vDivisions * vDivisorDifference; // latency 1 + latency 1 = latency 2 result := result + vDivisions; // latency 1 end; // single loop latency: 2 + 2 + 2 + 1 = 7 if Content >= Divisor then // latency 1 + latency 1 = latency 2 begin result := result + 1; // latency 1 end; // best case latency: 1 + 7 + 3 = 11 // worst case latency (max loops 11 for 32 bit longwords): 1 + 7 * 11 + 3 = 81 // very close to div 7, though this routine is probably directpath mostly and not vector path // so maybe it's faster... but then again still some calling overhead.. let's see what happens // when it's inlined ;) // yup inlining works real nice. // time for a benchmark :) // tested on amd x2 3800+ // for content values ranging from say 1000 to even a million this routine appears // to be faster than the div... kinda nice. // general routine is slower... didn't bother to optimize it or so ;) end; Bye, Skybuck.
From: Skybuck Flying on 24 Apr 2008 10:51 Hmm it's interesting to see that the dude was correct. By swapping around the additions it becomes a little bit faster: // faster code: // main algorithm while Content > vBiggerDivisor do // latency 1 + latency 1 = latency 2 begin vDivisions := (Content shr vShifter); // latency 1 + latency 1 = latency 2 result := result + vDivisions; // latency 1 Content := (Content and vAnder) + vDivisions * vDivisorDifference; // latency 1 + latency 1 = latency 2 end; WEIRD ?! Bye, Skybuck ;)
From: Skybuck Flying on 24 Apr 2008 11:09
Hmm kinda strange. The V2 which uses the assembler routine (leading zero count) is even slower then my simple while loop with the shifts when big ranges need to be calculated etc... weird. I have no explanation for it, except... maybe bsr takes longer... when range longer or something... me dont know. Oh well ;) Bye, Skybuck. |