Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Peter Olcott on 23 Mar 2010 00:33

The code below apparently proves that you were right all
along.
I ran it as two separate processes and it took a like 16.5
seconds for one instance and 16.55 seconds for two
instances.

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:uwC5O9jyKHA.3884(a)TK2MSFTNGP06.phx.gbl...
> Peter Olcott wrote:
>
>> Try running your process again using a
>> std::vector<unsigned int>
>> Make sure that you initialize all of this to the
>> subscript of the init loop.
>> Make sure that the process monitor shows that the amount
>> of memory you are allocating is the same amount that
>> total memory is reduced by.
>> Make sure that you only use 1/2 of total memory or less.
>> Make a not of the page fault behavior.
>
> >
>
>> I will try the same thing.
>
>
> You better! :)
>
> I'll BE BACK!
>
> --
> HLS

#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <time.h>

#define uint32 unsigned int

const uint32 repeat = 100;
const uint32 size = 524288000 / 4;
std::vector<uint32> Data;

void Process() {
clock_t finish;
clock_t start = clock();
double duration;
uint32 num;
for (uint32 r = 0; r < repeat; r++)
for (uint32 i = 0; i < size; i++)
num = Data[num];
finish = clock();
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf("%4.2f Seconds\n", duration);
}

int main() {
printf("Size in bytes--->%d\n", size * 4);
Data.reserve(size);
for (int N = 0; N < size; N++)
Data.push_back(rand() % size);

char N;
printf("Hit any key to Continue:");
scanf("%c", &N);

Process();

return 0;
}

From: Peter Olcott on 23 Mar 2010 00:43

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:2jfgq5tbgqtq046op34i7eitkotid2dgkn(a)4ax.com...
> See below...
> On Mon, 22 Mar 2010 21:07:35 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>message news:l45gq55hlc3sn35e2q6vq1ur6dbvsqvqr5(a)4ax.com...
>>> See below...
>>>
>>> On Mon, 22 Mar 2010 16:59:34 -0500, "Peter Olcott"
>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>>
>>>>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in
>>>>message
>>>>news:%23F2oLmgyKHA.5360(a)TK2MSFTNGP06.phx.gbl...
>>>>> Peter Olcott wrote:
>>>>>
>>>>>> Joe kept insisting and continues to insist that my
>>>>>> data
>>>>>> is not resident in memory.
>>>>>
>>>>>
>>>>> If you have a 32 bit Windows OS, you are limited to
>>>>> just
>>>>> 2GB RAW ACCESS and 4GB of VIRTUAL MEMORY.
>>>>
>>>>Yes, and that is another thing. I kept saying that I
>>>>have
>>>>a
>>>>64bit OS, and Joe kept forming his replies in terms of a
>>>>32-bit OS.
>>> ****
>>> And how long did I keep saying "Unless you are running a
>>> WIn32 process in Win64" but you
>>> did not clarify that you were running on Win64. So in
>>> the
>>> absence of any explicit
>>> statement I had to assume you were running in Win32.
>>> ****
>>>>
>>>>>
>>>>> If your process is loading 4GB, you are using virtual
>>>>> memory.
>>>>>
>>>>>> After loading my data and waiting twelve hours the
>>>>>> process monitor reports zero page faults, when I
>>>>>> execute
>>>>>> my process and run it to completion.
>>>>>
>>>>>
>>>>> You're lying, you told me you have PAGE FAULTS but it
>>>>> settle down to zero, which is NORMAL. But start a 2nd
>>>>> process and you will get page faults.
>>>>
>>>>I only get the page faults until the data is loaded.
>>>>After
>>>>the data is loaded I get essentially no more page
>>>>faults,
>>>>even after waiting twelve hours before running my
>>>>process
>>>>to
>>>>completion. After proving that my data is resident in
>>>>RAM
>>>>Joe continues to chide me for claiming that my data is
>>>>resident in RAM.
>>> ****
>>> If you used a memory-mapped file correctly, yu would
>>> have
>>> very low-cost page faults
>>> because you would be mapping to existing pages. But you
>>> seem to not want to hear that
>>> memory-mapped files will improve performance,
>>> particularly
>>> in a multiple-process
>>> environment.
>>> joe
>>> ****
>>
>>I don't want to hear about memory mapped files because I
>>don't want to hear about optimizing virtual memory usage
>>because I don't want to hear about virtual memory until it
>>is proven beyond all possible doubt that my process does
>>not
>>(and can not be made to be) resident in actual RAM all the
>>time.
> ****
> "I don't want to hear about the best way to optimie my
> performance because I am clueless
> about how virtual memory works and have my own belief
> about it, and I don't even want to
> hear that memory-mapped files have the same performance
> characteristics as ordinary pages
> and will be memory resident if there is nothing that
> forces them out, because I don't want
> to listen to any suggestion that might actually work"

It looks like I may have been right about virtual memory. I
just rewrote Hectors test to make it more close to my
process, and was surprised to find that one process instance
takes only a tiny little bit less than two process instances
running concurrently. It turns out that the part of the
process that I was benchmarking was almost entirely windows
generating character glyphs, and not my system recognizing
them.

Neither process paged any memory at all the whole time that
they executed. I will try another overnight run.

>
>
> SInce you don't understand virtual memory, and you
> CERTAINLY don't understand how
> memory-mapped files work, your rationale of why you don't
> want to hear about them is, to
> put it midly, completely silly.
> ****
>>
>>Since a test showed that my process did remain in actual
>>RAM
>>for at least twelve hours, this is sufficient evidence to
>>show that all of these lines of reason have at least for
>>the
>>moment become completely moot. The only thing that could
>>make them less than completely moot would be proof that my
>>process can not remain resident in RAM all the time.
> ***
> And doesn't this suggest that trying the multithreaded
> experiment is worthwhile? And why
> do you think memory-mapped files will not exhibit the SAME
> behavior? OH, never mind, you
> don't want to know that there are alternative solutions
> that might be more effective than
> what you are currently using, even one that can improve
> multiprocess behavior. SO you are
> saying "don't tell me the world can be made better, I
> don't want to make it better"
> joe
>
> ****
>>
>>>>
>>>>You guys just playing head games with me?
>>> ****
>>> We are trying to help you, in spite of your best efforts
>>> to tell us we are wrong. You
>>> insist that simplistic experiments which gave you a
>>> single
>>> data point give you a basis for
>>> extrapolating an entire family of performance
>>> information,
>>> and we are saying "You don't
>>> KNOW until you've MEASURED" and you insist that
>>> measurement is not relevant because you
>>> MUST be right. All I'm saying is that you MIGHT be
>>> right,
>>> and once you do the
>>> measurements, you might find out that you are completely
>>> WRONG, which works to your
>>> advantage. So run the damn expeimet, already!
>>> joe
>>>
>>> ****
>>>>
>>>>>
>>>>> I also asked, now 5 times, to provide the MEMORY LOAD
>>>>> percentage which I even provided with a simple C
>>>>> program
>>>>> that you can compile, and you did not:
>>>>>
>>>>> // File: V:\bin\memload.cpp
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <windows.h>
>>>>>
>>>>> void main(char argc, char *argv[])
>>>>> {
>>>>> MEMORYSTATUS ms;
>>>>> ms.dwLength = sizeof(ms);
>>>>> GlobalMemoryStatus(&ms);
>>>>> printf("Memory Load: %d%%",ms.dwMemoryLoad);
>>>>> }
>>>>>
>>>>> Why can't you even do that?
>>>>>
>>>>>> How does this not prove Joe is wrong (At least in the
>>>>>> specific instance of one execution of my process)?
>>>>>> (1) The process monitor is lying.
>>>>>> (2) Page faults do not measure virtual memory usage.
>>>>>
>>>>> There are now what 4-5 participants in the thread who
>>>>> are
>>>>> telling your thinking is wrong and lack a
>>>>> understanding
>>>>> of
>>>>> the Windows and Intel hardware.
>>>>>
>>>>> lets get a few more like this guy with a somewhat
>>>>> layman
>>>>> description:
>>>>>
>>>>> http://blogs.sepago.de/helge/2008/01/09/windows-x64-all-the-same-yet-very-different-part-1/
>>>>>
>>>>> and the #1 guy at Microsoft today!
>>>>>
>>>>> http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx
>>>>>
>>>>> If you DEFY what Mark Russinovich is saying here, you
>>>>> are
>>>>> CRAZY!
>>>>>
>>>>> --
>>>>> HLS
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 23 Mar 2010 00:44

See below...
On Mon, 22 Mar 2010 20:28:40 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:OTB08HiyKHA.5360(a)TK2MSFTNGP06.phx.gbl...
>> Peter Olcott wrote:
>>
>>>> It says nothing about one execution; it says that under
>>>> certain conditions, paging is not
>>>> an issue. It does not say anything about using multiple
>>>> threads on multiple cores within
>>>> a single process.
>>>
>>> This is the earlier issue where you claimed that my
>>> thinking that I needed to have my data resident in RAM
>>> was absurd and based on ignorance. I do need to have my
>>> data resident in RAM and indeed my data is resident in
>>> RAM for extended periods, and there is no ignorance
>>> associated with this thinking.
>>
>>
>> But is ignorance because your PROCESS MEMORY is VIRTUAL
>> MEMORY!
>
>Virtual memory is essentially disk pretending to be RAM,
***
clueless. This is only ONE aspect of virtual memory, It is an implementation issue of
how VM is realized. You seem to think that paging is what defines virtual memory; you are
completely WRONG in this regard. VM means that the addresses which are seen in the
process are NOT physical addresses, but mapped addresses. Period. That mapping has an
interesting property when physical memory is oversubscribed, which is the operating system
marks pages as being "not in memory" and provides support for bringing them in on demand;
this is "paged virtual memory". But note that you can run WIndows on a mchine without a
disk drive (boot from read-ony flash memory) and it STILL runs with virtual memory, but
without paging. You have unfortunately coupled these ideas into thinking that paged
memory is the ONLY kind of "virtual memory", which is a serious logical error. Your model
is not shared by anyone else who writes or understands modern operating systems.
****
>when disk is not used (no page faults) then it is no longer
>disk pretending to be RAM. Even though some of the VM
>infrastructure remains in place and still operates,
>(requiring a tiny bit of overhead) the part that most
>significantly impacts performance is not functioning. Thus
>from a performance point of view VM is essentially not
>functioning.
****
Go find out about the TLB and its purpose, and how a virtual address is converted to a
physical address (two additional memory fetches required in the worst case), before you
start talking about "tiny bit[s] of overhead". Of what is or is no functioning. VM is
ALWAYS functioning. And it ALWAYS has a cost. And that cost is nonlinear, and not
subject to static analysis.
****
>
>If you want to get nit picky and refrain from boiling things
>down to their essence you can say that VM is still
>operating. For all practical purposes from a pure
>performance point of view, VM is impacting performance
>negligibly, and thus can be construed as if it was not
>functioning. That is one example of the extraneous nit picky
>details that always boiling everything to its bare essence
>strips from further consideration.
****
Gee, I guess I missed how address translation worked when I read the Intel manuals...try
chapter 3 of Volume 3A of the Intel architecture manual (at least that's where it is in
the latest download I did). Then tell me if VM is free, even if there is no paging
traffic. This is why running multiple threads even on the same core will not necessarily
have a linear slowdown, because the TLB will smooth off some of the rough edges.

These little details matter to the level of 10x-20x performance, no small amount. But go
ahead, tell me they are fiddling little details that don't matter. Practical experience
suggests that they DO matter, a LOT.
joe

****
>
>>
>> Look, for people who have PCs with 1GB or 2GB of RAM, the
>> PROCESS STILL GETS 4GB.
>>
>> Where is it "GHOST RAM" coming from?
>>
>>
>> --
>> HLS
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 23 Mar 2010 00:48

See below...
On Mon, 22 Mar 2010 18:50:33 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:ufEXU5gyKHA.5940(a)TK2MSFTNGP02.phx.gbl...
>>
>> Pete Delgado wrote:
>>
>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>
>>
>>>>> He has NO CLUE as to what a "memory-mapped file"
>>>>> actually is. This last comment indicates
>>>> http://en.wikipedia.org/wiki/Memory-mapped_file
>>>> Apparently I do.
>>>
>>> I think you would be far better served by looking at
>>> Windows specific information on memory mapped files such
>>> as that which Joe suggested to you some time ago:
>>> Richter's Programming Applications for Microsoft Windows
>>> 4th.
>>
>> And I suggested waaaaaaaay back in the beginning of this
>> thread. :) I even gave him a link for a sweet
>> CMemoryMapFile class at MSDN!
>>
>> --
>> HLS
>
>I have proven that this is moot, and this proof continues to
>be ignored. I know you guys must be just messing with me
>because there are guys that are not just messing with me on
>several other groups. They can prove that they know what
>they are talking about by explaining how the underlying
>details fit together.
>
****
Yes, we enjoy torturing small animals, too. We are not "messing" with you, we are trying
to help you, but you don't want to listen, and keep telling us that our ideas are
worthless. Never mind that we've actually DONE this stuff before. I spent 15 years of my
life doing a LOT of performance evaluation and optimization. I know a little bit about
it. And I've spend some time studying how FORTRAN compilers optimize array access code by
cache optimization. We worried a lot about this sort of thing in the 1970s.
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 23 Mar 2010 00:55

Peter Olcott wrote:

> Try running your process again using a std::vector<unsigned
> int>
> Make sure that you initialize all of this to the subscript
> of the init loop.
> Make sure that the process monitor shows that the amount of
> memory you are allocating is the same amount that total
> memory is reduced by.
> Make sure that you only use 1/2 of total memory or less.
> Make a not of the page fault behavior.
> I will try the same thing.

Like I said, you better! I'm done!

Now, backgound. To emulate your machine, I only have a 2GB XP DUAL,
so the allocation is 1GB in this case.

This is a base line which is the SINGLE MAIN THREAD PROCESS:

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 17%
- Allocating Data:ram .... 1734
---------------------------------------
Time: 4437 | Elapsed: 0
---------------------------------------
Total Client Time: 4437

What I note here is that it took 1.7 seconds for the
std::vector<DWORD> allocation. So there is OVERHEAD associated with
this std c/c++ collection class. I have comments about this later.

Now is the test with TWO THREADS

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 17%
- Allocating Data:ram .... 1735
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 175 msecs.
- Resuming thread# 1 in 188 msecs.
* Wait For Thread Completion
- Memory Load: 64%
* Done
---------------------------------------
0 | Time: 4469 | Elapsed: 0
1 | Time: 4469 | Elapsed: 0
---------------------------------------
Total Time: 8938

VIOLA! Hardly any different with two threads. Lets try FOUR:

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 1734
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 in 756 msecs.
- Resuming thread# 1 in 806 msecs.
- Resuming thread# 2 in 224 msecs.
- Resuming thread# 3 in 19 msecs.
* Wait For Thread Completion
- Memory Load: 70%
* Done
---------------------------------------
0 | Time: 7953 | Elapsed: 0
1 | Time: 8485 | Elapsed: 0
2 | Time: 7984 | Elapsed: 0
3 | Time: 8359 | Elapsed: 0
---------------------------------------

So it averaged double time with 4 threads.

Remember, this is an std::vector() which is in my view not very good
idea here for this purpose.

What functionally do you get from this vector class? If you are just
looking for an index array, you would be better off using a straight
forward C array or if you what to use a Class, try CArray.

You are not getting any benefit from it using std::vector(). But
honestly, I have limited internal experience with std C/C++ collection
classes. The fact it took a long time to allocate tells me its not
very optimal.

Let me try with CArray.

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 1250
---------------------------------------
Time: 3875 | Elapsed: 0
---------------------------------------
Total Client Time: 3875

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 1250
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 5 msecs.
- Resuming thread# 1 in 825 msecs.
* Wait For Thread Completion
- Memory Load: 70%
* Done
---------------------------------------
0 | Time: 3922 | Elapsed: 0
1 | Time: 3922 | Elapsed: 0
---------------------------------------
Total Time: 7844

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 1234
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 in 682 msecs.
- Resuming thread# 1 in 16 msecs.
- Resuming thread# 2 in 157 msecs.
- Resuming thread# 3 in 406 msecs.
* Wait For Thread Completion
- Memory Load: 70%
* Done
---------------------------------------
0 | Time: 7735 | Elapsed: 0
1 | Time: 7390 | Elapsed: 0
2 | Time: 7594 | Elapsed: 0
3 | Time: 7312 | Elapsed: 0
---------------------------------------
Total Time: 30031

As you can see the MFC collection collection class was slightly faster!

But you can't beat going with a pure C array:

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 0
---------------------------------------
Time: 1938 | Elapsed: 0
---------------------------------------
Total Client Time: 1938

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 20 msecs.
- Resuming thread# 1 in 298 msecs.
* Wait For Thread Completion
- Memory Load: 69%
* Done
---------------------------------------
0 | Time: 2094 | Elapsed: 0
1 | Time: 1781 | Elapsed: 0
---------------------------------------
Total Time: 3875

V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
- size : 250000000
- memory : 1000000000 (976562K)
- repeat : 1
- Memory Load : 23%
- Allocating Data:ram .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 in 991 msecs.
- Resuming thread# 1 in 689 msecs.
- Resuming thread# 2 in 490 msecs.
- Resuming thread# 3 in 212 msecs.
* Wait For Thread Completion
- Memory Load: 70%
* Done
---------------------------------------
0 | Time: 2781 | Elapsed: 0
1 | Time: 2437 | Elapsed: 0
2 | Time: 2250 | Elapsed: 0
3 | Time: 2281 | Elapsed: 0
---------------------------------------
Total Time: 9749

SWEET!

Now, what I didn't break up before is that you can take even better
control by using better memory manager with your own HEAP manager.

I read you said that you load a bunch of files into your std::vector.

You can definitely do better when you have a bunch of files.

--
HLS

First | Prev | Next | Last
Pages: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system