From: mike3 on
On Apr 20, 1:43 am, "Ivan Vecerina"
<_INVALID_use_webfo...(a)ivan.vecerina.com> wrote:
> "mike3" <mike4...(a)yahoo.com> wrote in message
<snip>
> Assuming no truncation (because of adequate pre-allocation
> of result digits), the same algorithm can be written as:
>
>  for( unsigned aPos = 0 ; aPos<aLength ; ++aPos )
>  {
>     Digit carry = 0;
>     TwoDigit const aDig = a.digits[aPos]; //NB: cast "hidden" here
>     for( unsigned bPos = 0 ; bPos<bLength ; ++bPos )
>     {
>         TwoDigit mul = aDig * b.digits[bPos]
>                      + this->digits[aPos+bPos]
>                      + carry;
>
>         this->digits[aPos+bPos] =   mul &  digitBitMask;
>         carry                   = ( mul >> digitBitCount );
>     }
>     this->digits[aPos+bPos] = carry;
>  }
>
> There are many correct ways to write this, and some are
> probably better is some ways than this example.  But I
> hope that you will find it useful.
> In any case, given the very acceptable complexity of this loop,
> I would not bother breaking it up or adding layers of
> encapsulation.
>

Unfortunately, I noticed this last implementation where the
cast is "hidden" seems to lose performance. My benchmark
returned 17 seconds for this vs 12 for a similar routine where the
only difference is the cast is not "hidden" by assigning the digit
to a TwoDigit64.

I'm wondering if maybe this is because the compiler has done
a 32x32 multiply when using the one-liner with internal cast, and
returns 64-bit result, but when you try to "hide" the cast it tries
to do a 64x32 or even 64x64 multiply since it sees one 64-bit
operand coming in, which wastes time. I guess it depends on the
optimizer. In the one with the cast the optimizer can "see" that
both operands are 32-bit, so it could "know" to only do 32x32->64
instead of 64x32->64 or 64x64->64 (mod mul by 2^64).

Is this a good theory? Should I go with the casts?
From: Ivan Vecerina on
"mike3" <mike4ty4(a)yahoo.com> wrote in message
news:9cbb8777-2dbd-4a3b-8337-a08ba08cbde8(a)b5g2000pri.googlegroups.com...
>Unfortunately, I noticed this last implementation where the
>cast is "hidden" seems to lose performance. My benchmark
>returned 17 seconds for this vs 12 for a similar routine where the
>only difference is the cast is not "hidden" by assigning the digit
>to a TwoDigit64.
>
>I'm wondering if maybe this is because the compiler has done
>a 32x32 multiply when using the one-liner with internal cast, and
>returns 64-bit result, but when you try to "hide" the cast it tries
>to do a 64x32 or even 64x64 multiply since it sees one 64-bit
>operand coming in, which wastes time. I guess it depends on the
>optimizer. In the one with the cast the optimizer can "see" that
>both operands are 32-bit, so it could "know" to only do 32x32->64
>instead of 64x32->64 or 64x64->64 (mod mul by 2^64).
>
>Is this a good theory? Should I go with the casts?

These things are optimizer- and platform-dependent, but
yes, a "widened" multiplication is the likely culprit.
If you care about such low-level optimizations, I would
recommend learning to look at the assembly code (it is not
that difficult to get to a level where you can look and
see what your compiler does / where overheads come from).

Although in general it is always advisable to first look
for better algorithms (as these micro-optimizations can
be platform-specific, and become redundant or even
counter-productive in the future). For instance, a 64-bit
platform might be able to store the 64-bit word into
a local register, and execute the code I posted faster.


--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Brainbench MVP for C++ <> http://www.brainbench.com

From: mike3 on
On Apr 20, 2:39 pm, "Ivan Vecerina"
<_INVALID_use_webfo...(a)ivan.vecerina.com> wrote:
> "mike3" <mike4...(a)yahoo.com> wrote in message
>
> news:9cbb8777-2dbd-4a3b-8337-a08ba08cbde8(a)b5g2000pri.googlegroups.com...
>
>
>
>
>
> >Unfortunately, I noticed this last implementation where the
> >cast is "hidden" seems to lose performance. My benchmark
> >returned 17 seconds for this vs 12 for a similar routine where the
> >only difference is the cast is not "hidden" by assigning the digit
> >to a TwoDigit64.
>
> >I'm wondering if maybe this is because the compiler has done
> >a 32x32 multiply when using the one-liner with internal cast, and
> >returns 64-bit result, but when you try to "hide" the cast it tries
> >to do a 64x32 or even 64x64 multiply since it sees one 64-bit
> >operand coming in, which wastes time. I guess it depends on the
> >optimizer. In the one with the cast the optimizer can "see" that
> >both operands are 32-bit, so it could "know" to only do 32x32->64
> >instead of 64x32->64 or 64x64->64 (mod mul by 2^64).
>
> >Is this a good theory? Should I go with the casts?
>
> These things are optimizer- and platform-dependent, but
> yes, a "widened" multiplication is the likely culprit.
> If you care about such low-level optimizations, I would
> recommend learning to look at the assembly code (it is not
> that difficult to get to a level where you can look and
> see what your compiler does / where overheads come from).
>
> Although in general it is always advisable to first look
> for better algorithms (as these micro-optimizations can
> be platform-specific, and become redundant or even
> counter-productive in the future). For instance, a 64-bit
> platform might be able to store the 64-bit word into
> a local register, and execute the code I posted faster.
>

So what would be the best course of action in this case?
From: Ivan Vecerina on
"mike3" <mike4ty4(a)yahoo.com> wrote in message
news:b6849493-dc05-46b0-8778-793524945853(a)f24g2000prh.googlegroups.com...
On Apr 20, 2:39 pm, "Ivan Vecerina"
<_INVALID_use_webfo...(a)ivan.vecerina.com> wrote:
:> Although in general it is always advisable to first look
:> for better algorithms (as these micro-optimizations can
:> be platform-specific, and become redundant or even
:> counter-productive in the future). For instance, a 64-bit
:> platform might be able to store the 64-bit word into
:> a local register, and execute the code I posted faster.
:>
:
:So what would be the best course of action in this case?

If squeezing more performance matters, study literature and
find a faster algorithm. From a vague memory of reading
Knuth's TAOCP (was 20 years ago for me, as a teenager),
I vaguely remember that there is a faster algorithm for
multiplying multi-word integers, whose complexity is
O( n^log2(3) ) = O( n^1.58 ) rather than O(n^2) as you
do now. If your integers are large enough, this could
be considered... ( quick google search -> for example
http://en.wikipedia.org/wiki/Karatsuba_algorithm ).

But large integer multiplications are commonly needed
(math, cryptography, etc), and some have certainly been
optimized in assembly language, or even SIMD-optimized.
I'm not into this lately, but you will certainly find
open source projects that do it. Or it might be possible
to use the "Big Number" library subset from Intel's
Integrated Performance Primitives. It depends...

Good luck -Ivan

--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Brainbench MVP for C++ <> http://www.brainbench.com

From: mike3 on
On Apr 21, 7:26 am, "Ivan Vecerina"
<_INVALID_use_webfo...(a)ivan.vecerina.com> wrote:
> "mike3" <mike4...(a)yahoo.com> wrote in message
>
> news:b6849493-dc05-46b0-8778-793524945853(a)f24g2000prh.googlegroups.com...
> On Apr 20, 2:39 pm, "Ivan Vecerina"<_INVALID_use_webfo...(a)ivan.vecerina.com> wrote:
>
> :> Although in general it is always advisable to first look
> :> for better algorithms (as these micro-optimizations can
> :> be platform-specific, and become redundant or even
> :> counter-productive in the future). For instance, a 64-bit
> :> platform might be able to store the 64-bit word into
> :> a local register, and execute the code I posted faster.
> :>
> :
> :So what would be the best course of action in this case?
>
> If squeezing more performance matters, study literature and
> find a faster algorithm. From a vague memory of reading
> Knuth's TAOCP (was 20 years ago for me, as a teenager),
> I vaguely remember that there is a faster algorithm for
> multiplying multi-word integers, whose complexity is
> O( n^log2(3) ) = O( n^1.58 )   rather than O(n^2) as you
> do now.  If your integers are large enough, this could
> be considered... ( quick google search -> for examplehttp://en.wikipedia.org/wiki/Karatsuba_algorithm).
>

Ah, Knuth... the book I've dreamed of having for a long time
but haven't been able to get due to a lack of money and
access to an academic library. I only got one look at it,
and another opportunity has not yet presented itself... :(

Anyway, I've heard of these algorithms before, and even used
it once. There's also FFT (Fast Fourier Transform) as well,
which is even better at O(n log(n)). The problem though is
these fast algorithms do not work so well for the relatively
small number sizes I am considering here as they have more
overhead. Here I only need a few hundred bits. If I need
more, I could implement the faster method and then switch to
it when the numbers exceed a certain size threshold, but
for that small-size case, "grade school" like I'm using may
actually be best.

I've done some before in my attempts to write a program
to calculate digits of pi to a huge amount (something like
a hundred million or so), although I haven't been able to solve
the disk I/O problems in that arena (and lack of source code
to the really disk-efficient programs is not exactly a boon,
you know), where the data to be worked with is so huge we
have to store on disk.

What I've heard though from those dabblings is that the
best algorithm to use goes like this:

1. For little numbers, use Grade School.

2. For bigger numbers, use Karatsuba.

3. For really big numbers, use FFT.

Since I'm in the regime of (1), I have to use (1) and if a
typecast is needed to get my compiler to generate fast code,
I'll do it. Either that or just recode the mult routine in
ASM instead of C++. In ASM I can just get the 32x32->64 in
one instruction. Although the X86's paucity of Registers is
not exactly a nice thing when doing this stuff. I do actually
have a 64-bit X86 processor in my machine which has more
registers, but my development OS (Windows) is only 32 bits
and hence can only make do with 32-bit X86. I do have 64-bit
Solaris 10 UNIX though on my machine as well... but developing
Windows programs under UNIX?! No way, esp. considering I
wouldn't be able to execute the 64-bit code under Windows
anyway...

> But large integer multiplications are commonly needed
> (math, cryptography, etc), and some have certainly been
> optimized in assembly language, or even SIMD-optimized.
> I'm not into this lately, but you will certainly find
> open source projects that do it.  Or it might be possible
> to use the "Big Number" library subset from Intel's
> Integrated Performance Primitives. It depends...
>

Hmm. I'm thinking of taking a look over GMP's ASM soruce code
for X86 to see if maybe there might be some hint in that as
to how to do a fast ASM-based mul.

But for the generic "C++" code I think I'll just stay with the
typecast in there. Although I guess it's not a good idea to use
typecasts all over the place, this is *not* "all over the place",
this is a very tiny low-level core part of the code.