optimization [MFC]

Prev: Skinning dialog box using Feature pack
Next: System tray menu questions.

From: "Bill" don't want more on 19 Mar 2010 03:13

Another feature I had on my primes GUI was a CListCtrl with two columns
showing all the primes and their index or sequence number. I had diabled it
though because I found it taking a lot of time to add items into. I am using
the LVS_OWNERDATA style and handling the LVN_GETDISPINFO notification to
provide the display data. So scrolling up and down the array is no problem.
Is there any way that I haven't been able to find that will allow me to just
tell it how many items it has, rather than adding them one at a time? Adding
them one at a time seems to be time consuming like adding one element at a
time to a CArray. I really don't need the control to do anything for me
except know the number of items and handle the scrolling and ask me for the
data it needs to display. (I'm testing the timing now in debug mode by
inserting the 200million plus items into the CListCtrl anfter loading them
from disk and I think it's going to take longer than calculating the primes.
I'm also trying in release mode but after several minutes it's still adding
and I can't break the program to see the progress made so far.)

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
news:vro4q5t923qa256ej597fbkl88fg0dot6t(a)4ax.com...
> Sorry, because I can't see clearly, I think I sent a reply by hitting the
> wrong keystroke
> combinattion....
>
> What I had said was that allocation requires extra overhead. What I was
> about to say was
> that the most efficient representation would be a bit vector of bits that
> were 1 for a
> prime and 0 for a non-prime; this would require 2^32 bits, or 2^29 bytes,
> If you further
> apply RLE (run length encoding) to this bit vector, you get an even more
> compact
> representation which can be optimized by treating the special case of a
> 1-bit surrounded
> by a lot of 0 bits with a separate encoding. I would consider examining
> this over using a
> vector of primes.
> joe
>
> More below...
>
> On Thu, 18 Mar 2010 09:21:28 +0800, "Bill Brehm" <don't want spam> wrote:
>
>>"Someone one complained that they could only add about 5000
>>elements/second
>>to CArray"
>>
>>I am well aware of the growby parameter of SetSize() and have had it in my
>>program from the start. Yet CArray has an odd behavior in my program that
>>doesn't follow what one would expect knowing how the growby parameter is
>>supposed to make CArray work.
>>
>>I did some tests with my program using both debug mode and release mode.
>>Also I varied the growby parameter. Lastly, I can store the primes in a
>>CArray, a CList or a preallocated array of unsigned ints (needs just over
>>200,000,000 to fit the primes up to 2^32). I use the sieve algorithm for
>>all
>>timing results.
>>
>>Results for debug mode:
>>
>>int[] - around 43000 primes per second.
>>CList - around 39000 primes per second. The slightly lower speed is
>>probably
>>due to run time allocation of the memory for new elements.
>>CArray, growby 100,000 - start at around 35000 pps then tapers down to
>>16000
>>pps after 400,000 primes have been calculated. suddenly drops to 8000 pps
>>at
>>around 500,000 primes and tapers down to 4000 pps around 1,000,000 primes
>>calculated. one might expect it to slow down as it must reallocate and
>>copy
>>an ever growing array ten times during this test. I don't understand the
>>sudden drop at 500,000. I know that displaying the results on the GUI
>>takes
>>a lot of time, so I only display every 1000th prime. The timing will
>>include
>>the reallocation time but only once in every 100,000 primes generated and
>>therefore only once in ever 100 display updates.
>>CArray, growby 100 - start at around 40000 pps then tapers down to 8000
>>pps
>>after 500,000 primes have been calculated then tapers down to 2700 pps
>>around 1,000,000 primes calculated. Not surprising that it's slower with
>>more reallocations.
>>Here where it gets weird.
>>CArray, growby 10,000,000 - start at around 400 - 500 pps. It's too slow
>>to
>>wait for 1,000,000 primes to be generated to see the speed there. At
>>245,000
>>primes, the speed is the same. There is a longer delay at the start to
>>allocate the memory.
> ***
> In case the last message didn't make it, what I indicated here was the
> first element
> involves adding 10,000,000 elements, which is going to badly skew your
> first average. If
> you add "1" or "3" outside the measurement interval to factor that startup
> transient out
> of your measurements you might get a more interesting (and less biased)
> result. It all
> depends on what you are trying to measure.
> ****
>>
>>The sieve size is 1,000,000 odd numbers which generates 100,000 to 150,000
>>primes in the first few passes. The sieve runs very quickly. The time
>>being
>>measured at such low speeds is basically the time to copy data into the
>>CArray.
>>
>>Results for release mode:
>>
>>int[], CList and CArray with growby 10,000,000 - around 57,000 pps from
>>start and still at 1,000,000 primes and even at 3,000,000 primes. CList
>>takes a long time to give back the memory when I remove all the elements
>>from it.
>>CArray with growby 100 - around 57,000 pps to start and tapers down to
>>13,000 at 1,000,000 primes. This is not surprising as there will be 10
>>reallocations for every time the timing update is displayed and so the
>>reallocation time is included in the timing calculation.
>>
>>So I would conclude that in release mode, everything works logically. In
>>debug mode, I might have blamed page faults as a reason to slow down with
>>a
>>large CArray growby, but then why doesn't it affect the int[] which is the
>>same size and allocated in one big chunk like the CArray is. I suspect
>>there
>>is something funny (too much overhead) in CArray in debug mode for large
>>growby sizes.
> ****
> There is a lot of initialization that goes on in debug mode that you don't
> get in release
> mode. This can result in a lot of paging and in a lot of data copies of
> the
> initialization values. A big blip in the radar. Measure the time it
> takes to add the
> first element to that huge growby array and you might want to factor that
> out.
> ****
>>
>>"But tha fundamental failure in thinking is thinking that the extra space
>>actually matters. It may not."
>>
>>It does in my case because I reach the 2GB limit before I reach the end of
>>the primes that I want to calculate up to 2^32. There are 200,000,000+
>>primes in that set and at 4 bytes per prime for CArray or int[], I can fit
>>in 800MB but with CList I would need 2.4GB at 12 bytes or 3,2GB at 16
>>bytes
>>per prime.
> ****
> Plus storage allocator overhead plus allocation quantum roundup, you may
> be allocating
> between 32 and 72 bytes per storage element. ALlocation in debug mode is
> much larger.
> ****
>>
>>"It doesn't scale; note that modern encryption might use 1024, 2048 or
>>4096-bit primes"
>>
>>So you were referring to arbitrarily large numbers. when you said 'large
>>integer', I thought you might be referring to LARGE_INTEGER which is only
>>64
>>bits. Using the primes up to 2^32, I could use the sieve to calculate all
>>the primes up to LARGE_INTEGER. There just wouldn't be enough RAM or disk
>>space (on my PC) to hold them because there are somewhere between 10^15
>>and
>>10^16 of them.
> ****
> LARGE_INTEGER, which is only 64 bits, is too small for effective
> encryption. It is
> possible to crack 64-bit encryption in hours to days, making it
> ineffective. 256-bit to
> 4096-bit encryption is considered state-of-the-art for PKI today. Really
> secure
> encryption uses one-time pads generated by truly random processes (such as
> radioactive
> decay)
> ****
>>
>>"No, serialization is intended to replace the contents, so you are seeing
>>the specified
>>behavior. Appending was never considered as a part of the design. In
>>fact,
>>it would have
>>been considered a bug by the designers if serialization appended."
>>
>>In my case it IS appending. If I start out with one, two or three elements
>>in the CList, after I serialize from disk, those original elements are
>>still
>>in there. I don't know why but I think and agree with you that they should
>>not be.
> *****
> one of the reasons I don't trust serialization is that it has too many
> "unspecified"
> features. I can write serialization code in seconds that does what I
> want, and I know the
> specs.
> joe
> *****
>>
>>
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
>>news:fofvp5tejd5101f2pu7974u9upb2qmim9t(a)4ax.com...
>>> See below...
>>> On Tue, 16 Mar 2010 21:38:46 +0800, "Bill Brehm" <don't want spam>
>>> wrote:
>>>
>>>>"Your choice of CArray is probably a Bad Idea."
>>>>
>>>>It IS turning out to be a bad idea. This was one of the things I said I
>>>>would get around to asking. If I give a growby of 10000, the program
>>>>runs
>>>>at
>>>>a reasonable speed. When I set it much higher, like 100,000,000,
>>>>thinking
>>>>that i'll have fewer reallocations, it runs really slowly. I would have
>>>>expected the opposite until 1e8 primes were calculated, then a large
>>>>delay
>>>>during reallocation. But it's slow from the start.
>>> ***
>>> Someone one complained that they could only add about 5000
>>> elements/second
>>> to CArray. I
>>> tried the same experiment, and found that indeed that was the number I
>>> was
>>> getting, in
>>> Debug mode. They were adding 1.5M elements to the CArray. I set the
>>> expansion quantum to
>>> 500,000 and measured in Release mode (which is much faster) and got 15M
>>> elements/second.
>>> CArray without using SetSize to modify the quantum is horrendous
>>> performance,
>>> approximately exponential in time with the number of elements.
>>> std::vector is very close
>>> to linear (I did the measurements, but never published them).
>>> *****
>>>>I did some
>>>>PerformanceQueries and found that adding something to the CArray (when
>>>>it
>>>>grew by 10000) was ten times slow than doing a division as part of the
>>>>prime
>>>>check. I don't know if it's a paging issue or some inefficiency in
>>>>CArray
>>>>itself. I guess it's much worse with a large growby but i don't know why
>>>>and
>>>>i didn't measure to check.
>>> ****
>>> Add 1 element: allocate a new array of size+1; copy size elements from
>>> the
>>> old array, add
>>> the new element to the end. Expoential peformance; the time goes up as
>>> approximately the
>>> square of the number of elements (do the arithmetic for the copy
>>> operation). Never
>>> believe ANYTHING you measure in Debug mode; it is irrelevant and will
>>> always be horrible.
>>> This is because there are bounds checks and new-storage-iniitailization
>>> that take time.
>>> Most paging performance is irrelevant unless your vector gets into the
>>> hundreds of
>>> megabytes in size, at which point you definitely have the wrong
>>> representation.
>>> ****
>>>>
>>>>Would the release version be much faster than the debug version (less
>>>>checking for bounds and other stuff)?
>>> ****
>>> Absolutely! See previous comment; you CANNOT rely on "Debug"
>>> measurements
>>> for true
>>> performance guarantees.
>>> ****
>>>>Is there a way (with task manager
>>>>maybe) to tell if the CArray is staying in memory or being swapped out
>>>>to
>>>>disk? Is there a way to force the CArray to stay in memory and not swap
>>>>out?
>>>>Any guess why it might be so slow?
>>> ****
>>> Watch for increases in paging faults using the perfmon tool. A
>>> substantial increase in
>>> paging frequency is a Bad Sign. Note that mking your algorithm
>>> cache-aware can generally
>>> improve performance by 10X, but you need to know the cache parameters
>>> for
>>> each computer
>>> (which you can discover) to write code like this, and it doesn't always
>>> work due to the
>>> nature of the particular algorithm involved.
>>> ****
>>>>
>>>>"you have made a fundaemental failure in thinking here"
>>>>
>>>>Maybe so, but it turns out that the CList is much faster than the
>>>>CArray,
>>>>in
>>>>my program at least. Too bad it takes four times the memory and I run
>>>>out
>>>>before I can reach 2^32. Also, it takes much longer to clear out the
>>>>CList
>>>>than the CArray, which is logical since it's made up of many tiny parts
>>>>instead of a few large parts.
>>> ****
>>> But tha fundamental failure in thinking is thinking that the extra space
>>> actually matters.
>>> It may not. Note that while adding to an array with SetSize(..., 0)
>>> (the
>>> default), is
>>> exponential in time, adding to a list is constant time, so burning the
>>> extra space would
>>> not matter. Unless you exceed the working set size, at which point, it
>>> becomes a paging
>>> problem, or if the elements are widely scattered in memory, or the
>>> storage
>>> allocator
>>> becomes the performance bottleneck (which it can). If you anticipate
>>> 1,000,000 elements,
>>> do a SetSize(0, 1000000); when the array is empty; then the array append
>>> is constant time
>>> up to 1,000,000 elements also, and at 1,000,001 there is a blip i n the
>>> performance as the
>>> array is reallocated to size 2,000,000 and then time is constant again
>>> to
>>> 2,000,000. IF
>>> you are not doing the SetSize, then your assumption are all broken and
>>> you
>>> will get
>>> exponential cost. So you cannot
>>> (a) predict performance if you are in Debug mode
>>> (b) predict performance if you have not forced preallocation (SetSize)
>>> which is handled "by magic" in std::vector
>>> You have apparently committed both sins here, so your choices don't have
>>> a
>>> sound basis.
>>> joe
>>> ****
>>>>
>>>>I found another issue with CList which I think I would consider a bug.
>>>>When
>>>>I Serialize back from disk into a CArray, anything in the array is wiped
>>>>out
>>>>and overwritten. When I do that into the CList, the disk data is
>>>>appended
>>>>to
>>>>the end of what is in the list. I didn't see that described in the help
>>>>but
>>>>I also didn't find any mention of this "bug" in a quick Google search.
>>> ****
>>> No, serialization is intended to replace the contents, so you are seeing
>>> the specified
>>> behavior. Appending was never considered as a part of the design. In
>>> fact, it would have
>>> been considered a bug by the designers if serialization appended.
>>> ****
>>>>
>>>>"I wouldn't touch MFC serialization as far as I could throw a
>>>>mainframe."
>>>>
>>>>Ha ha, how far can you throw a mainframe? I am learning to agree with
>>>>you
>>>>on
>>>>this. I don't think I ever used it before. I started to use it this time
>>>>because it is much faster and uses less space than saving and loading in
>>>>text format. But if it's not trustworthy then it's not worth it.
>>> ****
>>> Given that I'm actually physically handicapped, and probably couldn't
>>> throw a mere laptop
>>> very far, throwing a 1-ton mainframe isn't likely. Probably measured in
>>> nanometers. The
>>> real problem is schema evolution, a problem I had to address when I
>>> designed what is a lot
>>> like XML back in 1977 (except our design was far better). Today, I use
>>> XML for my data
>>> representation.
>>> *****
>>>>
>>>>"Note that most prime algorithms are variants of the Seive of
>>>>Eastosthenes"
>>>>
>>>>Yes, I'm using that. My interest in playing with primes was triggered by
>>>>trying to learn about encryption and modular arithmetic used in RSA. But
>>>>my
>>>>first program ever was a prime number generator running on some HP Basic
>>>>computer back in the early 70's, so it's interesting to me to bring that
>>>>up
>>>>to date. My program also can run the old way by dividing by primes until
>>>>the
>>>>sqrt of the number under test. I think my original Basic program divided
>>>>by
>>>>all odd numbers because there wasn't enough memory to store the array of
>>>>primes on that old computer.
>>> ****
>>> Note that RSA wants to use massive prime numbers (which are almost
>>> impossible to factor).
>>> I have been talking with a colleague who is an encryption guy and we are
>>> talking about
>>> developing a course in encryption when we would use RSA-8 (8-bit primes)
>>> so students could
>>> work out the encryption with just a calculator or even pencil and paper
>>> and then show why
>>> short key are vulnerable by asking them to work out encrypted data
>>> (since
>>> 8-bit primes
>>> would be easy to crack), to give some real feel for the algorithms. He
>>> has taken a course
>>> from Ron Rivest (the R of RSA) and is really into this stuff.
>>> ****
>>>>
>>>>Actually the sieve technique should scale to large integers (if you mean
>>>>the
>>>>64 bit type and not arbitrarily large) once one has an array of primes
>>>>to
>>>>2^32 and if the sieving is done in segments.
>>> ****
>>> It doesn't scale; note that modern encryption might use 1024, 2048 or
>>> 4096-bit primes, and
>>> facroting these with moden computers and moden algorithms takes
>>> intervals
>>> such that the
>>> Sun is likely to expand into a Red Giant and engulf the Earth before the
>>> algorithm
>>> terminates.
>>> joe
>>> ****
>>>>
>>>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
>>>>news:b0htp5l4t2bi28po0b6gfqbfkctt27up6a(a)4ax.com...
>>>>> See below...
>>>>> On Mon, 15 Mar 2010 23:06:24 +0800, "Bill Brehm" <don't want spam>
>>>>> wrote:
>>>>>
>>>>>>I am working on a program to calculate primes. It's just for fun, not
>>>>>>for
>>>>>>work or school, if that matters here. I'm trying to optimize it and
>>>>>>I'll
>>>>>>have some more questions about that later.
>>>>>>
>>>>>>Today I changed the CArray used to hold the primes to a CList based on
>>>>>>a
>>>>>>compile time selection, figuring I would save the time of reallocating
>>>>>>and
>>>>>>copying the CArray every time I went over the maximum allocated size.
>>>>>>I'm
>>>>>>aware of the pros and cons of CArray and CList in general.
>>>>> ****
>>>>> Your choice of CArray is probably a Bad Idea. First, you should
>>>>> SetSize
>>>>> and set an
>>>>> explicit reallocation quantum (the second parameter of SetSize) to
>>>>> minimize the amount of
>>>>> reallocation that is done. Or use std::vector which does this
>>>>> implicitly,
>>>>> and had far,
>>>>> far better performance than CArray.
>>>>> ****
>>>>>>A disadvantage is
>>>>>>that the CList of primes takes (about) four time as much memory as the
>>>>>>CArray holding the same number of primes. The primes are 32 bit ints
>>>>>>so
>>>>>>I
>>>>>>don't know why it's four times instead of three times (previous
>>>>>>pointer,
>>>>>>prime and next pointer). What else is in there?
>>>>> ****
>>>>> you have made a fundaemental failure in thinking here. You think, for
>>>>> example, that the
>>>>> size of any elmeent matters, which is the wrong way to think about the
>>>>> problem. What
>>>>> matter is the memory footprint, and the induced paging, which is
>>>>> potentially far worse in
>>>>> a CLIst. Note that most prime algorithms are variants of the Seive of
>>>>> Eastosthenes, which
>>>>> means they don't scal to large integer (such as are using for
>>>>> encryption)
>>>>> and you can find
>>>>> enumerations of all the primes in the range of 0..2**32-1 on the Web
>>>>> and
>>>>> in reference
>>>>> books. LIss are nearly always less efficient than arrays due to
>>>>> locality-of-reference
>>>>> issues that affect cache behavior and paging.
>>>>> *****
>>>>>>
>>>>>>I tried something that I wasn't sure would work but it did. I
>>>>>>serialized
>>>>>>a
>>>>>>CArray to disk and then later I serialized it from disk to a CList. It
>>>>>>worked. It's not a total surprise, because it wouldn't make sense to
>>>>>>store
>>>>>>the next and previous CList pointers into a serialize archive. My
>>>>>>question
>>>>>>here is, does anyone know if these were made intentionally compatible
>>>>>>and
>>>>>>if
>>>>>>I can count on that working in all cases?
>>>>> ****
>>>>> I wouldn't touch MFC serialization as far as I could throw a
>>>>> mainframe.
>>>>> I
>>>>> would not use
>>>>> it for anything I cared about.
>>>>> joe
>>>>> ****
>>>>>>
>>>>>>Thanks...
>>>>>>
>>>>> Joseph M. Newcomer [MVP]
>>>>> email: newcomer(a)flounder.com
>>>>> Web: http://www.flounder.com
>>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Woody on 19 Mar 2010 03:25

Bill wrote:
> There was one more question related to speed that I haven't asked yet. When
> the program is running full speed, the task manager shows it's only using
> 49 - 51 percent of the CPU. The idle process is using most of the rest.
>... My processor does have two CPUs, but the affinity is set to both and
> all the CPU usage numbers add up to 100%, not 200%, so I believe the 50% is
> not due to only running on one of the two processors.

I think you are seeing your calculations being done single-threaded (i
e, on only one of the cores). You might want to modify your pgm to be
multi-threaded (2 worker threads, for example). It is an interesting
question how to divide the work when there is a single bit vector that
both threads need to access.

As far as the memory, using a bit vector should only require 2^32/8/2
bytes, since you know there are no even primes >2.

From: "Bill" don't want more on 19 Mar 2010 05:21

Woody,

"only require 2^32/8/2"

You are right. I thought of that after sending when I was thinking what to
do next in the project, but didn't think it worth making another post to
correct.

About the 50%. If you are right then it's already running at maximum. I
should find a quad core and see if it says 25% only. But if you are right,
then why wouldn't all the numbers add up to 200% instead of 100%?

I hust tried running two copies. At first they started out sharing the ~50%,
but eventually they settled out sharing ~100%. Four copies settled out to
sharing 100% too - about 25% each. But system idle time went down to zero,
so the total still adds up to 100%, not 100% for each core.

I'm not sure what to conclude.

Bill

"Woody" <ols6000(a)sbcglobal.net> wrote in message
news:5c3c5b21-0bcb-4588-8da9-459ff368e082(a)z35g2000yqd.googlegroups.com...
> Bill wrote:
>> There was one more question related to speed that I haven't asked yet.
>> When
>> the program is running full speed, the task manager shows it's only using
>> 49 - 51 percent of the CPU. The idle process is using most of the rest.
>>... My processor does have two CPUs, but the affinity is set to both and
>> all the CPU usage numbers add up to 100%, not 200%, so I believe the 50%
>> is
>> not due to only running on one of the two processors.
>
> I think you are seeing your calculations being done single-threaded (i
> e, on only one of the cores). You might want to modify your pgm to be
> multi-threaded (2 worker threads, for example). It is an interesting
> question how to divide the work when there is a single bit vector that
> both threads need to access.
>
> As far as the memory, using a bit vector should only require 2^32/8/2
> bytes, since you know there are no even primes >2.

From: Ismo Salonen on 19 Mar 2010 07:44

--snip--

The total system utilization can only be up to 100% regardless of cores.
Thats the way system defines it. In quad core one core delivers 25% of
performance.

ismo

From: Joseph M. Newcomer on 19 Mar 2010 11:55

Ther are two ways to handle this:

locking the entire structure (the naive approach)

Use the InterlockedOr opertion to set a bit in a single DWORD in a multiprocessor-safe
fashion.

I'd use the latter method because it doesn't require any thread blocking and is handled by
the hardware. So once I figure out the DWORD in the array that needed to be changed (and
it has to be DWORD-aligned) and come up with the bit mask for the bit within that DWORD,
the InterlockedOr will safely allow multiple threads to set bits in the same DWORD without
any problem.
joe

On Fri, 19 Mar 2010 00:25:09 -0700 (PDT), Woody <ols6000(a)sbcglobal.net> wrote:

>Bill wrote:
>> There was one more question related to speed that I haven't asked yet. When
>> the program is running full speed, the task manager shows it's only using
>> 49 - 51 percent of the CPU. The idle process is using most of the rest.
>>... My processor does have two CPUs, but the affinity is set to both and
>> all the CPU usage numbers add up to 100%, not 200%, so I believe the 50% is
>> not due to only running on one of the two processors.
>
>I think you are seeing your calculations being done single-threaded (i
>e, on only one of the cores). You might want to modify your pgm to be
>multi-threaded (2 worker threads, for example). It is an interesting
>question how to divide the work when there is a single bit vector that
>both threads need to access.
>
>As far as the memory, using a bit vector should only require 2^32/8/2
>bytes, since you know there are no even primes >2.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Skinning dialog box using Feature pack
Next: System tray menu questions.