From: Skybuck Flying on
function SimulatedDiv7( Content : longword ) : longword; inline;
var
vEightDivisions : longword;
begin
result := 0; // latency 1

while Content > 7 do // latency 1 + latency 1 = latency 2
begin
vEightDivisions := (Content shr 3); // latency 1 + latency 1 = latency 2
Content := (Content and 7) + vEightDivisions; // latency 1 + latency 1 =
latency 2
result := result + vEightDivisions; // latency 1
end;

// single loop latency: 2 + 2 + 2 + 1 = 7

if Content = 7 then // latency 1 + latency 1 = latency 2
begin
result := result + 1; // latency 1
end;

// best case latency: 1 + 7 + 3 = 11
// worst case latency (max loops 11 for 32 bit longwords): 1 + 7 * 11 + 3 =
81

// very close to div 7, though this routine is probably directpath mostly
and not vector path
// so maybe it's faster... but then again still some calling overhead..
let's see what happens
// when it's inlined ;)
// yup inlining works real nice.

// time for a benchmark :)

// tested on amd x2 3800+
// for content values ranging from say 1000 to even a million this routine
appears
// to be faster than the div... kinda nice.

end;


From: Skybuck Flying on
I suspected the algorithm was general.

To prove it's general, here is a general version which works with any
divisor.

I didn't bother optimizing it though.

Just a proof of concept ;)

No documentation is provided for the algorithm by me... that's the fun part
for you... try to figure out how it works :) LOL.

If you really interested in it, and can't figure it out... let me know I
might tell you LOL.

// Quickly written and quickly tested, not thoroughly tested.

// general version just out of curiosity...
function SimulatedDivGeneral( Content : longword; Divisor : longword ) :
longword;
var
vTemp : longword;
vDivisions : longword;
vBiggerDivisor : longword;
vShifter : byte;
vAnder : longword;
vDivisorDifference : longword;
begin
result := 0; // latency 1

// search for bigger binary/shift divisor the the current divisor
vBiggerDivisor := 1;
vShifter := 0;
vAnder := 0;

vTemp := Divisor;

while vTemp > 0 do
begin
vBiggerDivisor := vBiggerDivisor shl 1;
vShifter := vShifter + 1;
vAnder := vAnder shl 1 + 1;
vTemp := vTemp shr 1;
end;

vDivisorDifference := vBiggerDivisor - Divisor;

// main algorithm
while Content > vBiggerDivisor do // latency 1 + latency 1 = latency 2
begin
vDivisions := (Content shr vShifter); // latency 1 + latency 1 = latency 2
Content := (Content and vAnder) + vDivisions * vDivisorDifference; //
latency 1 + latency 1 = latency 2
result := result + vDivisions; // latency 1
end;

// single loop latency: 2 + 2 + 2 + 1 = 7
if Content >= Divisor then // latency 1 + latency 1 = latency 2
begin
result := result + 1; // latency 1
end;

// best case latency: 1 + 7 + 3 = 11
// worst case latency (max loops 11 for 32 bit longwords): 1 + 7 * 11 + 3 =
81

// very close to div 7, though this routine is probably directpath mostly
and not vector path
// so maybe it's faster... but then again still some calling overhead..
let's see what happens
// when it's inlined ;)
// yup inlining works real nice.

// time for a benchmark :)

// tested on amd x2 3800+
// for content values ranging from say 1000 to even a million this routine
appears
// to be faster than the div... kinda nice.

// general routine is slower... didn't bother to optimize it or so ;)
end;

Bye,
Skybuck.


From: Skybuck Flying on
Care to share your benchmark program ? :)

Such a big difference.

Maybe it's the way you benchmarked it ;)

Read the notes though:

It's only faster for content range 0 to 1 million or so.

Bye,
Skybuck.



From: Chuck Crayne on
On Thu, 24 Apr 2008 04:27:50 +0200
"Skybuck Flying" <BloodyShame(a)hotmail.com> wrote:

> Care to share your benchmark program ? :)

Care to post the generated asm code? Maybe we can speed it up even more.

--
Chuck
http://www.pacificsites.com/~ccrayne/charles.html


From: Skybuck Flying on
No problemo: :)

Project1.dpr.267: vResult := SimulatedDiv7v3( vIndex );
0040902D 8BC6 mov eax,esi
0040902F 33C9 xor ecx,ecx
00409031 83F807 cmp eax,$07
00409034 7611 jbe $00409047
00409036 8BD0 mov edx,eax
00409038 C1EA03 shr edx,$03
0040903B 83E007 and eax,$07
0040903E 03C2 add eax,edx
00409040 03CA add ecx,edx
00409042 83F807 cmp eax,$07
00409045 77EF jnbe $00409036
00409047 83F807 cmp eax,$07
0040904A 7501 jnz $0040904d
0040904C 41 inc ecx
0040904D 8BD9 mov ebx,ecx

Bye,
Skybuck.

"Chuck Crayne" <ccrayne(a)crayne.org> wrote in message
news:20080423205904.4166fff5(a)thor.crayne.org...
> On Thu, 24 Apr 2008 04:27:50 +0200
> "Skybuck Flying" <BloodyShame(a)hotmail.com> wrote:
>
>> Care to share your benchmark program ? :)
>
> Care to post the generated asm code? Maybe we can speed it up even more.
>
> --
> Chuck
> http://www.pacificsites.com/~ccrayne/charles.html
>
>