Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Hector Santos on 21 Mar 2010 21:30

Attached "version 2" of Testpeter2t.cpp, with command line help and
more options to play with different scenarios without recompiling.

testpeter2t /?

testpeter2t [options]

/t - start 2 threads to test
/t:n - start N threads to test
/s:# - # of DWORDs in array, default creates 1.4GB bytes
/r:# - repeat memory reader loop # times (10)

No switches will start a single main thread process test

Example: start 8 threads with ~390 MB array

Testpeter2t /t:8 /s:100000000

Example result on a DUAL CORE 2GB Windows XP

- size : 100000000
- memory : 400000000 (390625K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.
- Resuming thread# 4 [000007D0] in 169 msecs.
- Resuming thread# 5 [000007CC] in 724 msecs.
- Resuming thread# 6 [000007C8] in 478 msecs.
- Resuming thread# 7 [000007C4] in 358 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 10687 | Elapsed: 0 | Len: 0
1 | Time: 11157 | Elapsed: 0 | Len: 0
2 | Time: 11922 | Elapsed: 0 | Len: 0
3 | Time: 11984 | Elapsed: 0 | Len: 0
4 | Time: 12125 | Elapsed: 0 | Len: 0
5 | Time: 12000 | Elapsed: 0 | Len: 0
6 | Time: 11438 | Elapsed: 0 | Len: 0
7 | Time: 11313 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 92626

--
HLS

Hector Santos wrote:

> Peter Olcott wrote:
>
>> I have an application that uses enormous amounts of RAM in a
>> very memory bandwidth intensive way. I recently upgraded my
>> hardware to a machine with 600% faster RAM and 32-fold more
>> L3 cache. This L3 cache is also twice as fast as the prior
>> machines cache. When I benchmarked my application across the
>> two machines, I gained an 800% improvement in wall clock
>> time. The new machines CPU is only 11% faster than the prior
>> machine. Both processes were tested on a single CPU.
>>
>> I am thinking that all of the above would tend to show that
>> my process is very memory bandwidth intensive, and thus
>> could not benefit from multiple threads on the same machine
>> because the bottleneck is memory bandwidth rather than CPU
>> cycles. Is this analysis correct?
>
> As stated numerous times, your thinking is wrong. But I don't fault you
> because you don't have the experience here, but you should not be
> ignoring what EXPERTS are telling you - especially if you never written
> multi-threaded applications.
>
> Attached C/C++ simulation (testpeter2t.cpp) illustrates how your single
> main thread process with a HUGE redundant memory access requirement is
> not optimized for a multi-core/processor machine and for any kind of
> scalability and performance efficiency.
>
> Compile the attach application.
>
> TestPeter2T.CPP will allow you to test:
>
> Test #1 - a single main thread process
> Test #2 - a multi-threads (2) process.
>
> To run the single thread process, just run the EXE with no switches:
>
> Here is TEST #1
>
> V:\wc5beta> testpeter2t
>
> - size : 357913941
> - memory : 1431655764 (1398101K)
> - repeat : 10
> ---------------------------------------
> Time: 12297 | Elapsed: 0 | Len: 0
> ---------------------------------------
> Total Client Time: 12297
>
> The source code is set to allocate DWORD array with a total memory block
> of 1.4 GB. I have a 2GB XP Dual Core Intel box. It should 50% CPU.
>
> Now this single process test provides the natural quantum scenario with
> a processdata() function:
>
> void ProcessData()
> {
> KIND num;
> for(int r = 0; r < repeat; r++)
> for (DWORD i=0; i < size; i++)
> num = data[i];
> }
>
> By natural quantum, there is NO "man-made" interupts, sleeps or yields.
> The OS will preempt this as naturally it can do it every quantum.
>
> If you ran TWO single process installs like so:
>
> start testpeter2T
> start testpeter2T
>
> On my machine it is seriously degraded BOTH process because the HUGE
> virtual memory and paging requirements. The page faults were really
> HIGH and it just never completed and I didn't wish to wait because it
> was TOO obviously was not optimized for multiple instances. The memory
> load requirements was too high here.
>
> Now comes test #2 with threads, run the EXE with the /t switch and this
> will start TWO threads and here are the results:
>
> - size : 357913941
> - memory : 1431655764 (1398101K)
> - repeat : 10
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 [000007DC] in 41 msecs.
> - Resuming thread# 1 [000007F4] in 467 msecs.
> * Wait For Thread Completion
> * Done
> ---------------------------------------
> 0 | Time: 13500 | Elapsed: 0 | Len: 0
> 1 | Time: 13016 | Elapsed: 0 | Len: 0
> ---------------------------------------
> Total Time: 26516
>
> BEHOLD!! Scalability using a SHARED MEMORY ACCESS threaded design.
>
> I am going to recompile the code for 4 threads by changing:
>
> #define NUM_THREADS 4 // # of threads
>
> Lets try it:
>
> V:\wc5beta>testpeter2t /t
> - size : 357913941
> - memory : 1431655764 (1398101K)
> - repeat : 10
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 [000007DC] in 41 msecs.
> - Resuming thread# 1 [000007F4] in 467 msecs.
> - Resuming thread# 2 [000007D8] in 334 msecs.
> - Resuming thread# 3 [000007D4] in 500 msecs.
> * Wait For Thread Completion
> * Done
> ---------------------------------------
> 0 | Time: 26078 | Elapsed: 0 | Len: 0
> 1 | Time: 25250 | Elapsed: 0 | Len: 0
> 2 | Time: 25250 | Elapsed: 0 | Len: 0
> 3 | Time: 24906 | Elapsed: 0 | Len: 0
> ---------------------------------------
> Total Time: 101484
>
> So the summary so far:
>
> 1 thread - 12 ms
> 2 threads - 13 ms
> 4 threads - 25 ms
>
> This is where you begin to look at various designs to improve things.
> There are many ideas but it requires a look at your actual work load.
> We didn't use a MEMORY MAP FILE and that MIGHT help. I should try that,
> but lets try a 3 threads run:
>
> #define NUM_THREADS 3 // # of threads
>
> and recompile, run testpeter2t /t
>
> - size : 357913941
> - memory : 1431655764 (1398101K)
> - repeat : 10
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> * Resuming threads
> - Resuming thread# 0 [000007DC] in 41 msecs.
> - Resuming thread# 1 [000007F4] in 467 msecs.
> - Resuming thread# 2 [000007D8] in 334 msecs.
> * Wait For Thread Completion
> * Done
> ---------------------------------------
> 0 | Time: 19453 | Elapsed: 0 | Len: 0
> 1 | Time: 13890 | Elapsed: 0 | Len: 0
> 2 | Time: 18688 | Elapsed: 0 | Len: 0
> ---------------------------------------
> Total Time: 52031
>
> How interesting!! To see how one thread got a near best case result.
>
> You can actually normalize all this can probably come how with a formula
> to guessimate what the performance with be with requests. But this is
> where WORKER POOLS and IOCP come into play and if you are using NUMA,
> the Windows NUMA API will help there too!
>
> All in all peter, this proves how multithreads, using shared memory is
> FAR superior then your misconceived idea that your application can not
> be resigned for multi-core/processor machine.
>
> I am willing to bet this simulator is for more stressful than your own
> DFA/OCR application in its work load. ProcessData() here is don't NO
> WORK at all but accessing memory. You will not be doing this, so the
> ODDS are very high you will run much more efficiently than this simulator.
>
> I want to hear you say "Oh My!" <g>
>

From: Peter Olcott on 21 Mar 2010 22:06

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com...
> NOte in the i7 architecture the L3 cache is shared across
> all CPUs, so you are less likely
> to be hit by raw memory bandwidth (which compared to a CPU
> is dead-slow), and the answer s
> so whether multiple threads will work effectively can only
> be determined by measurement of
> a multithreaded app.
>
> Because your logic seems to indicate that raw memory speed
> is the limiting factor, and you
> have not accounted for the effects of a shared L3 cache,
> any opnion you offer on what is
> going to happen is meaningless. In fact, any opinion
> about performanance is by definition
> meaningless; only actual measurements represent facts ("If
> you can't express it in
> numbers, it ain't science, it's opinion" -- Robert A.
> Heinlein)

(1) Machine A performs process B in X minutes.
(2) Machine C performs process B in X/8 Minutes (800%
faster)
(3) The only difference between machine A and machine C is
that machine C has much faster access to RAM (by whatever
means).
(4) Therefore Process B is memory bandwidth bound.

> More below...
> On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>I have an application that uses enormous amounts of RAM in
>>a
>>very memory bandwidth intensive way. I recently upgraded
>>my
>>hardware to a machine with 600% faster RAM and 32-fold
>>more
>>L3 cache. This L3 cache is also twice as fast as the prior
>>machines cache. When I benchmarked my application across
>>the
>>two machines, I gained an 800% improvement in wall clock
>>time. The new machines CPU is only 11% faster than the
>>prior
>>machine. Both processes were tested on a single CPU.
> ***
> The question is whether you are measuring multiple threads
> in a single executable image
> across multiple cores, or multiple executable images on a
> single core. Not sure how you
> know that both processes were tested on a single CPU,
> since you don't mention how you
> accomplished this (there are several techniques, but it is
> important to know which one you
> used, since each has its own implications for predicting
> overall behavior of a system).
> ****
>>
>>I am thinking that all of the above would tend to show
>>that
>>my process is very memory bandwidth intensive, and thus
>>could not benefit from multiple threads on the same
>>machine
>>because the bottleneck is memory bandwidth rather than CPU
>>cycles. Is this analysis correct?
> ****
> Nonsense! You have no idea what is going on here! The
> shared L3 cache could completely
> wipe out the memory performance issue, reducing your
> problem to a cache-performance issue.
> Since you have not conducted the experiment in multiple
> threading, you have no data to
> indicate one way or the other what is going on, and it is
> the particular memory access
> patterns of YOUR app that matter, and therefore, nobody
> can offer a meaningful estimate
> based on your L1/L2/L3 cache accessses, whatever they may
> be.
> joe
> ****
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 21 Mar 2010 23:43

Peter Olcott wrote:

>
> (1) Machine A performs process B in X minutes.
> (2) Machine C performs process B in X/8 Minutes (800%
> faster)
> (3) The only difference between machine A and machine C is
> that machine C has much faster access to RAM (by whatever
> means).
> (4) Therefore Process B is memory bandwidth bound.
>

Forget that. I just spent a few hours proving to your with posted
testpeter2t.cpp to illustrate how a multi-thread huge shared data
process is superior over running multiple process instances with
redundant huge data loading.

Have you tested it yourself?

--
HLS

From: Peter Olcott on 21 Mar 2010 23:50

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:esxrhHXyKHA.5776(a)TK2MSFTNGP06.phx.gbl...
> Peter Olcott wrote:
>
>>
>> (1) Machine A performs process B in X minutes.
>> (2) Machine C performs process B in X/8 Minutes (800%
>> faster)
>> (3) The only difference between machine A and machine C
>> is that machine C has much faster access to RAM (by
>> whatever means).
>> (4) Therefore Process B is memory bandwidth bound.
>>
>
> Forget that. I just spent a few hours proving to your
> with posted testpeter2t.cpp to illustrate how a
> multi-thread huge shared data process is superior over
> running multiple process instances with redundant huge
> data loading.
>
> Have you tested it yourself?
>
> --
> HLS

If you can provide a valid counter example to my reasoning
above, please do so, otherwise I will have to simply assume
that you are stubbornly wrong.

From: Hector Santos on 22 Mar 2010 02:54

Here is the result using a 1.5GB readonly memory mapped file. I
started with 1 single process thread, then switch to 2 threads, then
4, 6, 8, 10 and 12 threads. Notice how the processing time for the
earlier threads started high but decreased with the later thread. This
was the caching effect of the readonly memory file. Also note the
Global Memory Status *MEMORY LOAD* percentage. For my machine, it is
at 19% at steady state. But as expected it shuts up when dealing with
this large memory map file. I probably can fine tune the map views
better, but they are set as read only. Well, I'll leave OP to figure
out memory maps coding for his patented DFA meta file process.

V:\wc5beta>testpeter3t /s:3000000 /r:1
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0
---------------------------------------
Time: 2984 | Elapsed: 0
---------------------------------------
Total Client Time: 2984

V:\wc5beta>testpeter3t /s:3000000 /t:2 /r:1
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
* Wait For Thread Completion
- Memory Load: 96%
* Done
---------------------------------------
0 | Time: 5407 | Elapsed: 0
1 | Time: 4938 | Elapsed: 0
---------------------------------------
Total Time: 10345

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:4
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
* Wait For Thread Completion
- Memory Load: 97%
* Done
---------------------------------------
0 | Time: 6313 | Elapsed: 0
1 | Time: 5844 | Elapsed: 0
2 | Time: 5500 | Elapsed: 0
3 | Time: 5000 | Elapsed: 0
---------------------------------------
Total Time: 22657

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:6
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
* Wait For Thread Completion
- Memory Load: 97%
* Done
---------------------------------------
0 | Time: 6359 | Elapsed: 0
1 | Time: 5891 | Elapsed: 0
2 | Time: 5547 | Elapsed: 0
3 | Time: 5047 | Elapsed: 0
4 | Time: 4875 | Elapsed: 0
5 | Time: 4141 | Elapsed: 0
---------------------------------------
Total Time: 31860

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:8
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 16
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
- Resuming thread# 6 in 478 msecs.
- Resuming thread# 7 in 358 msecs.
* Wait For Thread Completion
- Memory Load: 96%
* Done
---------------------------------------
0 | Time: 6203 | Elapsed: 0
1 | Time: 5734 | Elapsed: 0
2 | Time: 5391 | Elapsed: 0
3 | Time: 4891 | Elapsed: 0
4 | Time: 4719 | Elapsed: 0
5 | Time: 3984 | Elapsed: 0
6 | Time: 3500 | Elapsed: 0
7 | Time: 3125 | Elapsed: 0
---------------------------------------
Total Time: 37547

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:10
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7
- Creating thread 8
- Creating thread 9
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
- Resuming thread# 6 in 478 msecs.
- Resuming thread# 7 in 358 msecs.
- Resuming thread# 8 in 962 msecs.
- Resuming thread# 9 in 464 msecs.
* Wait For Thread Completion
- Memory Load: 97%
* Done
---------------------------------------
0 | Time: 7234 | Elapsed: 0
1 | Time: 6766 | Elapsed: 0
2 | Time: 6422 | Elapsed: 0
3 | Time: 5922 | Elapsed: 0
4 | Time: 5750 | Elapsed: 0
5 | Time: 5016 | Elapsed: 0
6 | Time: 4531 | Elapsed: 0
7 | Time: 4125 | Elapsed: 0
8 | Time: 3203 | Elapsed: 0
9 | Time: 2703 | Elapsed: 0
---------------------------------------
Total Time: 51672

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:12
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 16
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7
- Creating thread 8
- Creating thread 9
- Creating thread 10
- Creating thread 11
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
- Resuming thread# 6 in 478 msecs.
- Resuming thread# 7 in 358 msecs.
- Resuming thread# 8 in 962 msecs.
- Resuming thread# 9 in 464 msecs.
- Resuming thread# 10 in 705 msecs.
- Resuming thread# 11 in 145 msecs.
* Wait For Thread Completion
- Memory Load: 97%
* Done
---------------------------------------
0 | Time: 7984 | Elapsed: 0
1 | Time: 7515 | Elapsed: 0
2 | Time: 7188 | Elapsed: 0
3 | Time: 6672 | Elapsed: 0
4 | Time: 6500 | Elapsed: 0
5 | Time: 5781 | Elapsed: 0
6 | Time: 5250 | Elapsed: 0
7 | Time: 4953 | Elapsed: 0
8 | Time: 3953 | Elapsed: 0
9 | Time: 3484 | Elapsed: 0
10 | Time: 2750 | Elapsed: 0
11 | Time: 2547 | Elapsed: 0
---------------------------------------
Total Time: 64577

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system