|
Prev: china discount wholesale D&G CA POLO ARMANI t-shirts, airmax87, airmaxTN, airmax LTD shoes
Next: Windows System Programming:Develope unique aspects of Windows
From: mike3 on 20 Apr 2008 16:18 On Apr 20, 1:43 am, "Ivan Vecerina" <_INVALID_use_webfo...(a)ivan.vecerina.com> wrote: > "mike3" <mike4...(a)yahoo.com> wrote in message <snip> > Assuming no truncation (because of adequate pre-allocation > of result digits), the same algorithm can be written as: > > for( unsigned aPos = 0 ; aPos<aLength ; ++aPos ) > { > Digit carry = 0; > TwoDigit const aDig = a.digits[aPos]; //NB: cast "hidden" here > for( unsigned bPos = 0 ; bPos<bLength ; ++bPos ) > { > TwoDigit mul = aDig * b.digits[bPos] > + this->digits[aPos+bPos] > + carry; > > this->digits[aPos+bPos] = mul & digitBitMask; > carry = ( mul >> digitBitCount ); > } > this->digits[aPos+bPos] = carry; > } > > There are many correct ways to write this, and some are > probably better is some ways than this example. But I > hope that you will find it useful. > In any case, given the very acceptable complexity of this loop, > I would not bother breaking it up or adding layers of > encapsulation. > Unfortunately, I noticed this last implementation where the cast is "hidden" seems to lose performance. My benchmark returned 17 seconds for this vs 12 for a similar routine where the only difference is the cast is not "hidden" by assigning the digit to a TwoDigit64. I'm wondering if maybe this is because the compiler has done a 32x32 multiply when using the one-liner with internal cast, and returns 64-bit result, but when you try to "hide" the cast it tries to do a 64x32 or even 64x64 multiply since it sees one 64-bit operand coming in, which wastes time. I guess it depends on the optimizer. In the one with the cast the optimizer can "see" that both operands are 32-bit, so it could "know" to only do 32x32->64 instead of 64x32->64 or 64x64->64 (mod mul by 2^64). Is this a good theory? Should I go with the casts?
From: Ivan Vecerina on 20 Apr 2008 16:39 "mike3" <mike4ty4(a)yahoo.com> wrote in message news:9cbb8777-2dbd-4a3b-8337-a08ba08cbde8(a)b5g2000pri.googlegroups.com... >Unfortunately, I noticed this last implementation where the >cast is "hidden" seems to lose performance. My benchmark >returned 17 seconds for this vs 12 for a similar routine where the >only difference is the cast is not "hidden" by assigning the digit >to a TwoDigit64. > >I'm wondering if maybe this is because the compiler has done >a 32x32 multiply when using the one-liner with internal cast, and >returns 64-bit result, but when you try to "hide" the cast it tries >to do a 64x32 or even 64x64 multiply since it sees one 64-bit >operand coming in, which wastes time. I guess it depends on the >optimizer. In the one with the cast the optimizer can "see" that >both operands are 32-bit, so it could "know" to only do 32x32->64 >instead of 64x32->64 or 64x64->64 (mod mul by 2^64). > >Is this a good theory? Should I go with the casts? These things are optimizer- and platform-dependent, but yes, a "widened" multiplication is the likely culprit. If you care about such low-level optimizations, I would recommend learning to look at the assembly code (it is not that difficult to get to a level where you can look and see what your compiler does / where overheads come from). Although in general it is always advisable to first look for better algorithms (as these micro-optimizations can be platform-specific, and become redundant or even counter-productive in the future). For instance, a 64-bit platform might be able to store the 64-bit word into a local register, and execute the code I posted faster. -- http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form Brainbench MVP for C++ <> http://www.brainbench.com
From: mike3 on 20 Apr 2008 16:46 On Apr 20, 2:39 pm, "Ivan Vecerina" <_INVALID_use_webfo...(a)ivan.vecerina.com> wrote: > "mike3" <mike4...(a)yahoo.com> wrote in message > > news:9cbb8777-2dbd-4a3b-8337-a08ba08cbde8(a)b5g2000pri.googlegroups.com... > > > > > > >Unfortunately, I noticed this last implementation where the > >cast is "hidden" seems to lose performance. My benchmark > >returned 17 seconds for this vs 12 for a similar routine where the > >only difference is the cast is not "hidden" by assigning the digit > >to a TwoDigit64. > > >I'm wondering if maybe this is because the compiler has done > >a 32x32 multiply when using the one-liner with internal cast, and > >returns 64-bit result, but when you try to "hide" the cast it tries > >to do a 64x32 or even 64x64 multiply since it sees one 64-bit > >operand coming in, which wastes time. I guess it depends on the > >optimizer. In the one with the cast the optimizer can "see" that > >both operands are 32-bit, so it could "know" to only do 32x32->64 > >instead of 64x32->64 or 64x64->64 (mod mul by 2^64). > > >Is this a good theory? Should I go with the casts? > > These things are optimizer- and platform-dependent, but > yes, a "widened" multiplication is the likely culprit. > If you care about such low-level optimizations, I would > recommend learning to look at the assembly code (it is not > that difficult to get to a level where you can look and > see what your compiler does / where overheads come from). > > Although in general it is always advisable to first look > for better algorithms (as these micro-optimizations can > be platform-specific, and become redundant or even > counter-productive in the future). For instance, a 64-bit > platform might be able to store the 64-bit word into > a local register, and execute the code I posted faster. > So what would be the best course of action in this case?
From: Ivan Vecerina on 21 Apr 2008 09:26 "mike3" <mike4ty4(a)yahoo.com> wrote in message news:b6849493-dc05-46b0-8778-793524945853(a)f24g2000prh.googlegroups.com... On Apr 20, 2:39 pm, "Ivan Vecerina" <_INVALID_use_webfo...(a)ivan.vecerina.com> wrote: :> Although in general it is always advisable to first look :> for better algorithms (as these micro-optimizations can :> be platform-specific, and become redundant or even :> counter-productive in the future). For instance, a 64-bit :> platform might be able to store the 64-bit word into :> a local register, and execute the code I posted faster. :> : :So what would be the best course of action in this case? If squeezing more performance matters, study literature and find a faster algorithm. From a vague memory of reading Knuth's TAOCP (was 20 years ago for me, as a teenager), I vaguely remember that there is a faster algorithm for multiplying multi-word integers, whose complexity is O( n^log2(3) ) = O( n^1.58 ) rather than O(n^2) as you do now. If your integers are large enough, this could be considered... ( quick google search -> for example http://en.wikipedia.org/wiki/Karatsuba_algorithm ). But large integer multiplications are commonly needed (math, cryptography, etc), and some have certainly been optimized in assembly language, or even SIMD-optimized. I'm not into this lately, but you will certainly find open source projects that do it. Or it might be possible to use the "Big Number" library subset from Intel's Integrated Performance Primitives. It depends... Good luck -Ivan -- http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form Brainbench MVP for C++ <> http://www.brainbench.com
From: mike3 on 21 Apr 2008 23:55
On Apr 21, 7:26 am, "Ivan Vecerina" <_INVALID_use_webfo...(a)ivan.vecerina.com> wrote: > "mike3" <mike4...(a)yahoo.com> wrote in message > > news:b6849493-dc05-46b0-8778-793524945853(a)f24g2000prh.googlegroups.com... > On Apr 20, 2:39 pm, "Ivan Vecerina"<_INVALID_use_webfo...(a)ivan.vecerina.com> wrote: > > :> Although in general it is always advisable to first look > :> for better algorithms (as these micro-optimizations can > :> be platform-specific, and become redundant or even > :> counter-productive in the future). For instance, a 64-bit > :> platform might be able to store the 64-bit word into > :> a local register, and execute the code I posted faster. > :> > : > :So what would be the best course of action in this case? > > If squeezing more performance matters, study literature and > find a faster algorithm. From a vague memory of reading > Knuth's TAOCP (was 20 years ago for me, as a teenager), > I vaguely remember that there is a faster algorithm for > multiplying multi-word integers, whose complexity is > O( n^log2(3) ) = O( n^1.58 ) rather than O(n^2) as you > do now. If your integers are large enough, this could > be considered... ( quick google search -> for examplehttp://en.wikipedia.org/wiki/Karatsuba_algorithm). > Ah, Knuth... the book I've dreamed of having for a long time but haven't been able to get due to a lack of money and access to an academic library. I only got one look at it, and another opportunity has not yet presented itself... :( Anyway, I've heard of these algorithms before, and even used it once. There's also FFT (Fast Fourier Transform) as well, which is even better at O(n log(n)). The problem though is these fast algorithms do not work so well for the relatively small number sizes I am considering here as they have more overhead. Here I only need a few hundred bits. If I need more, I could implement the faster method and then switch to it when the numbers exceed a certain size threshold, but for that small-size case, "grade school" like I'm using may actually be best. I've done some before in my attempts to write a program to calculate digits of pi to a huge amount (something like a hundred million or so), although I haven't been able to solve the disk I/O problems in that arena (and lack of source code to the really disk-efficient programs is not exactly a boon, you know), where the data to be worked with is so huge we have to store on disk. What I've heard though from those dabblings is that the best algorithm to use goes like this: 1. For little numbers, use Grade School. 2. For bigger numbers, use Karatsuba. 3. For really big numbers, use FFT. Since I'm in the regime of (1), I have to use (1) and if a typecast is needed to get my compiler to generate fast code, I'll do it. Either that or just recode the mult routine in ASM instead of C++. In ASM I can just get the 32x32->64 in one instruction. Although the X86's paucity of Registers is not exactly a nice thing when doing this stuff. I do actually have a 64-bit X86 processor in my machine which has more registers, but my development OS (Windows) is only 32 bits and hence can only make do with 32-bit X86. I do have 64-bit Solaris 10 UNIX though on my machine as well... but developing Windows programs under UNIX?! No way, esp. considering I wouldn't be able to execute the 64-bit code under Windows anyway... > But large integer multiplications are commonly needed > (math, cryptography, etc), and some have certainly been > optimized in assembly language, or even SIMD-optimized. > I'm not into this lately, but you will certainly find > open source projects that do it. Or it might be possible > to use the "Big Number" library subset from Intel's > Integrated Performance Primitives. It depends... > Hmm. I'm thinking of taking a look over GMP's ASM soruce code for X86 to see if maybe there might be some hint in that as to how to do a fast ASM-based mul. But for the generic "C++" code I think I'll just stay with the typecast in there. Although I guess it's not a good idea to use typecasts all over the place, this is *not* "all over the place", this is a very tiny low-level core part of the code. |