From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:O0kgYUkyKHA.404(a)TK2MSFTNGP02.phx.gbl...
> Peter Olcott wrote:
>
>> Try running your process again using a
>> std::vector<unsigned int>
>> Make sure that you initialize all of this to the
>> subscript of the init loop.
>> Make sure that the process monitor shows that the amount
>> of memory you are allocating is the same amount that
>> total memory is reduced by.
>> Make sure that you only use 1/2 of total memory or less.
>> Make a not of the page fault behavior.
>> I will try the same thing.
>
>
> Like I said, you better! I'm done!

I posted my code and my results. I ran it as a single
process and two separate processes, concurrently.
One process took 16.5 seconds of wall clock time.
Two concurrent processes took 16.55 seconds of wall clock
time.
This proves that you were right all along.
Which means that my process will scale much better than I
expected.

>
> Now, backgound. To emulate your machine, I only have a
> 2GB XP DUAL, so the allocation is 1GB in this case.
>
> This is a base line which is the SINGLE MAIN THREAD
> PROCESS:
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 17%
> - Allocating Data:ram .... 1734
> ---------------------------------------
> Time: 4437 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 4437
>
> What I note here is that it took 1.7 seconds for the
> std::vector<DWORD> allocation. So there is OVERHEAD
> associated with this std c/c++ collection class. I have
> comments about this later.
>
> Now is the test with TWO THREADS
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 17%
> - Allocating Data:ram .... 1735
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 175 msecs.
> - Resuming thread# 1 in 188 msecs.
> * Wait For Thread Completion
> - Memory Load: 64%
> * Done
> ---------------------------------------
> 0 | Time: 4469 | Elapsed: 0
> 1 | Time: 4469 | Elapsed: 0
> ---------------------------------------
> Total Time: 8938
>
> VIOLA! Hardly any different with two threads. Lets try
> FOUR:
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1734
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 756 msecs.
> - Resuming thread# 1 in 806 msecs.
> - Resuming thread# 2 in 224 msecs.
> - Resuming thread# 3 in 19 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 7953 | Elapsed: 0
> 1 | Time: 8485 | Elapsed: 0
> 2 | Time: 7984 | Elapsed: 0
> 3 | Time: 8359 | Elapsed: 0
> ---------------------------------------
>
> So it averaged double time with 4 threads.
>
> Remember, this is an std::vector() which is in my view not
> very good idea here for this purpose.
>
> What functionally do you get from this vector class? If
> you are just looking for an index array, you would be
> better off using a straight forward C array or if you what
> to use a Class, try CArray.
>
> You are not getting any benefit from it using
> std::vector(). But honestly, I have limited internal
> experience with std C/C++ collection classes. The fact it
> took a long time to allocate tells me its not very
> optimal.
>
> Let me try with CArray.
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1250
> ---------------------------------------
> Time: 3875 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 3875
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1250
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 5 msecs.
> - Resuming thread# 1 in 825 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 3922 | Elapsed: 0
> 1 | Time: 3922 | Elapsed: 0
> ---------------------------------------
> Total Time: 7844
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1234
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 682 msecs.
> - Resuming thread# 1 in 16 msecs.
> - Resuming thread# 2 in 157 msecs.
> - Resuming thread# 3 in 406 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 7735 | Elapsed: 0
> 1 | Time: 7390 | Elapsed: 0
> 2 | Time: 7594 | Elapsed: 0
> 3 | Time: 7312 | Elapsed: 0
> ---------------------------------------
> Total Time: 30031
>
> As you can see the MFC collection collection class was
> slightly faster!
>
> But you can't beat going with a pure C array:
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 0
> ---------------------------------------
> Time: 1938 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 1938
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 20 msecs.
> - Resuming thread# 1 in 298 msecs.
> * Wait For Thread Completion
> - Memory Load: 69%
> * Done
> ---------------------------------------
> 0 | Time: 2094 | Elapsed: 0
> 1 | Time: 1781 | Elapsed: 0
> ---------------------------------------
> Total Time: 3875
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 991 msecs.
> - Resuming thread# 1 in 689 msecs.
> - Resuming thread# 2 in 490 msecs.
> - Resuming thread# 3 in 212 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 2781 | Elapsed: 0
> 1 | Time: 2437 | Elapsed: 0
> 2 | Time: 2250 | Elapsed: 0
> 3 | Time: 2281 | Elapsed: 0
> ---------------------------------------
> Total Time: 9749
>
> SWEET!
>
> Now, what I didn't break up before is that you can take
> even better control by using better memory manager with
> your own HEAP manager.
>
> I read you said that you load a bunch of files into your
> std::vector.
>
> You can definitely do better when you have a bunch of
> files.
>
>
>
> --
> HLS


From: Hector Santos on
Hmmmmm, you mean two threads in one process?

What is this:

num = Data[num]

Do you mean:

num = Data[i];

Take the posted code I gave you and change this part:


#define USE_STD_VECTOR
#include <vector>

//------------------------------------------------------
// Parameters to play with
//------------------------------------------------------

#define KIND DWORD // array element type
#define MAX_THREADS 64 // # of threads
DWORD nRepeat = 10; // data access repeats
DWORD nTotalThreads = 2; // # of threads
DWORD size = MAXLONG/6; // ~1.4GB
#idef USE_STD_VECTOR
std::vector<KIND> *data = NULL;
#else
KIND *data = NULL;
#endif

//------------------------------------------------------
// Functions to simulate application work load
// The process data function simply reads the
// memory.
//------------------------------------------------------

BOOL AllocateData()
{
DWORD t1 = GetTickCount();
_cprintf("- Allocating Data:ram .... ");
#idef USE_STD_VECTOR
data = new std::vector<KIND>(size);
#else
data = new KIND[size];
#endif
_cprintf("%d\n",GetTickCount()-t1);
return TRUE;
}

void DeallocateData()
{
if (bUseFileMap) {
fmdata.Close();
} else {
delete data;
}
}

#pragma optimize("",off)
void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
for (DWORD i=0; i < size; i++) {
DWORD j = i;
#idef USE_STD_VECTOR
num = (*data)[j];
#else
num = data[j];
#endif
}
}
}
#pragma optimize("",on)

And run it with no switches and then /t:2 and /t:4.

WATCH it performs for better!

I would also explore it with USE_STD_VECTOR commented out.

--


Peter Olcott wrote:

> The code below apparently proves that you were right all
> along.
> I ran it as two separate processes and it took a like 16.5
> seconds for one instance and 16.55 seconds for two
> instances.
>
> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
> news:uwC5O9jyKHA.3884(a)TK2MSFTNGP06.phx.gbl...
>> Peter Olcott wrote:
>>
>>> Try running your process again using a
>>> std::vector<unsigned int>
>>> Make sure that you initialize all of this to the
>>> subscript of the init loop.
>>> Make sure that the process monitor shows that the amount
>>> of memory you are allocating is the same amount that
>>> total memory is reduced by.
>>> Make sure that you only use 1/2 of total memory or less.
>>> Make a not of the page fault behavior.
>>> I will try the same thing.
>>
>> You better! :)
>>
>> I'll BE BACK!
>>
>> --
>> HLS
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <vector>
> #include <time.h>
>
> #define uint32 unsigned int
>
> const uint32 repeat = 100;
> const uint32 size = 524288000 / 4;
> std::vector<uint32> Data;
>
>
>
> void Process() {
> clock_t finish;
> clock_t start = clock();
> double duration;
> uint32 num;
> for (uint32 r = 0; r < repeat; r++)
> for (uint32 i = 0; i < size; i++)
> num = Data[num];
> finish = clock();
> duration = (double)(finish - start) / CLOCKS_PER_SEC;
> printf("%4.2f Seconds\n", duration);
> }
>
>
>
> int main() {
> printf("Size in bytes--->%d\n", size * 4);
> Data.reserve(size);
> for (int N = 0; N < size; N++)
> Data.push_back(rand() % size);
>
> char N;
> printf("Hit any key to Continue:");
> scanf("%c", &N);
>
> Process();
>
> return 0;
> }
>
>
>
>



--
HLS
From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:O0kgYUkyKHA.404(a)TK2MSFTNGP02.phx.gbl...
> Peter Olcott wrote:
>
>> Try running your process again using a
>> std::vector<unsigned int>
>> Make sure that you initialize all of this to the
>> subscript of the init loop.
>> Make sure that the process monitor shows that the amount
>> of memory you are allocating is the same amount that
>> total memory is reduced by.
>> Make sure that you only use 1/2 of total memory or less.
>> Make a not of the page fault behavior.
>> I will try the same thing.
>
>
> Like I said, you better! I'm done!

Even when I nearly max out my RAM with four concurrent
processes using nearly 2.0 GB each, the total time of four
concurrent processes is only 1.33-fold more than the time
for a single process.

>
> Now, backgound. To emulate your machine, I only have a
> 2GB XP DUAL, so the allocation is 1GB in this case.
>
> This is a base line which is the SINGLE MAIN THREAD
> PROCESS:
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 17%
> - Allocating Data:ram .... 1734
> ---------------------------------------
> Time: 4437 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 4437
>
> What I note here is that it took 1.7 seconds for the
> std::vector<DWORD> allocation. So there is OVERHEAD
> associated with this std c/c++ collection class. I have
> comments about this later.
>
> Now is the test with TWO THREADS
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 17%
> - Allocating Data:ram .... 1735
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 175 msecs.
> - Resuming thread# 1 in 188 msecs.
> * Wait For Thread Completion
> - Memory Load: 64%
> * Done
> ---------------------------------------
> 0 | Time: 4469 | Elapsed: 0
> 1 | Time: 4469 | Elapsed: 0
> ---------------------------------------
> Total Time: 8938
>
> VIOLA! Hardly any different with two threads. Lets try
> FOUR:
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1734
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 756 msecs.
> - Resuming thread# 1 in 806 msecs.
> - Resuming thread# 2 in 224 msecs.
> - Resuming thread# 3 in 19 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 7953 | Elapsed: 0
> 1 | Time: 8485 | Elapsed: 0
> 2 | Time: 7984 | Elapsed: 0
> 3 | Time: 8359 | Elapsed: 0
> ---------------------------------------
>
> So it averaged double time with 4 threads.
>
> Remember, this is an std::vector() which is in my view not
> very good idea here for this purpose.
>
> What functionally do you get from this vector class? If
> you are just looking for an index array, you would be
> better off using a straight forward C array or if you what
> to use a Class, try CArray.
>
> You are not getting any benefit from it using
> std::vector(). But honestly, I have limited internal
> experience with std C/C++ collection classes. The fact it
> took a long time to allocate tells me its not very
> optimal.
>
> Let me try with CArray.
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1250
> ---------------------------------------
> Time: 3875 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 3875
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1250
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 5 msecs.
> - Resuming thread# 1 in 825 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 3922 | Elapsed: 0
> 1 | Time: 3922 | Elapsed: 0
> ---------------------------------------
> Total Time: 7844
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 1234
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 682 msecs.
> - Resuming thread# 1 in 16 msecs.
> - Resuming thread# 2 in 157 msecs.
> - Resuming thread# 3 in 406 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 7735 | Elapsed: 0
> 1 | Time: 7390 | Elapsed: 0
> 2 | Time: 7594 | Elapsed: 0
> 3 | Time: 7312 | Elapsed: 0
> ---------------------------------------
> Total Time: 30031
>
> As you can see the MFC collection collection class was
> slightly faster!
>
> But you can't beat going with a pure C array:
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 0
> ---------------------------------------
> Time: 1938 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 1938
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 20 msecs.
> - Resuming thread# 1 in 298 msecs.
> * Wait For Thread Completion
> - Memory Load: 69%
> * Done
> ---------------------------------------
> 0 | Time: 2094 | Elapsed: 0
> 1 | Time: 1781 | Elapsed: 0
> ---------------------------------------
> Total Time: 3875
>
> V:\wc5beta>TestPeter4T.exe /r:1 /s:250000000 /t:4
> - size : 250000000
> - memory : 1000000000 (976562K)
> - repeat : 1
> - Memory Load : 23%
> - Allocating Data:ram .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 991 msecs.
> - Resuming thread# 1 in 689 msecs.
> - Resuming thread# 2 in 490 msecs.
> - Resuming thread# 3 in 212 msecs.
> * Wait For Thread Completion
> - Memory Load: 70%
> * Done
> ---------------------------------------
> 0 | Time: 2781 | Elapsed: 0
> 1 | Time: 2437 | Elapsed: 0
> 2 | Time: 2250 | Elapsed: 0
> 3 | Time: 2281 | Elapsed: 0
> ---------------------------------------
> Total Time: 9749
>
> SWEET!
>
> Now, what I didn't break up before is that you can take
> even better control by using better memory manager with
> your own HEAP manager.
>
> I read you said that you load a bunch of files into your
> std::vector.
>
> You can definitely do better when you have a bunch of
> files.
>
>
>
> --
> HLS


From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:etOekekyKHA.5036(a)TK2MSFTNGP02.phx.gbl...
> Hmmmmm, you mean two threads in one process?
>
> What is this:
>
> num = Data[num]
>
> Do you mean:
>
> num = Data[i];

No I mean it just like it is. I init all of memory with
random numbers and then access the memory location
references by these random numbers in a tight loop. This
attempts to force memory bandwidth use to its limit. Even
with four cores I do not reach the limit.

What are the heuristics for making a process thread safe?
(1) keep all data in locals to the best extent possible.
(2) Eliminate the need for global data that must be written
to if possible.
(3) Global data that must be read from is OK
(4) Only use thread safe libraries.

I think If I can follow all those rules, then the much more
complex rules aren't even needed.

Did I miss anything?

>
> Take the posted code I gave you and change this part:
>
>
> #define USE_STD_VECTOR
> #include <vector>
>
> //------------------------------------------------------
> // Parameters to play with
> //------------------------------------------------------
>
> #define KIND DWORD // array element
> type
> #define MAX_THREADS 64 // # of threads
> DWORD nRepeat = 10; // data access
> repeats
> DWORD nTotalThreads = 2; // # of threads
> DWORD size = MAXLONG/6; // ~1.4GB
> #idef USE_STD_VECTOR
> std::vector<KIND> *data = NULL;
> #else
> KIND *data = NULL;
> #endif
>
> //------------------------------------------------------
> // Functions to simulate application work load
> // The process data function simply reads the
> // memory.
> //------------------------------------------------------
>
> BOOL AllocateData()
> {
> DWORD t1 = GetTickCount();
> _cprintf("- Allocating Data:ram .... ");
> #idef USE_STD_VECTOR
> data = new std::vector<KIND>(size);
> #else
> data = new KIND[size];
> #endif
> _cprintf("%d\n",GetTickCount()-t1);
> return TRUE;
> }
>
> void DeallocateData()
> {
> if (bUseFileMap) {
> fmdata.Close();
> } else {
> delete data;
> }
> }
>
> #pragma optimize("",off)
> void ProcessData()
> {
> KIND num;
> for(DWORD r = 0; r < nRepeat; r++) {
> for (DWORD i=0; i < size; i++) {
> DWORD j = i;
> #idef USE_STD_VECTOR
> num = (*data)[j];
> #else
> num = data[j];
> #endif
> }
> }
> }
> #pragma optimize("",on)
>
> And run it with no switches and then /t:2 and /t:4.
>
> WATCH it performs for better!
>
> I would also explore it with USE_STD_VECTOR commented out.
>
> --
>
>
> Peter Olcott wrote:
>
>> The code below apparently proves that you were right all
>> along.
>> I ran it as two separate processes and it took a like
>> 16.5 seconds for one instance and 16.55 seconds for two
>> instances.
>>
>> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in
>> message news:uwC5O9jyKHA.3884(a)TK2MSFTNGP06.phx.gbl...
>>> Peter Olcott wrote:
>>>
>>>> Try running your process again using a
>>>> std::vector<unsigned int>
>>>> Make sure that you initialize all of this to the
>>>> subscript of the init loop.
>>>> Make sure that the process monitor shows that the
>>>> amount of memory you are allocating is the same amount
>>>> that total memory is reduced by.
>>>> Make sure that you only use 1/2 of total memory or
>>>> less.
>>>> Make a not of the page fault behavior.
>>>> I will try the same thing.
>>>
>>> You better! :)
>>>
>>> I'll BE BACK!
>>>
>>> --
>>> HLS
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <vector>
>> #include <time.h>
>>
>> #define uint32 unsigned int
>>
>> const uint32 repeat = 100;
>> const uint32 size = 524288000 / 4;
>> std::vector<uint32> Data;
>>
>>
>>
>> void Process() {
>> clock_t finish;
>> clock_t start = clock();
>> double duration;
>> uint32 num;
>> for (uint32 r = 0; r < repeat; r++)
>> for (uint32 i = 0; i < size; i++)
>> num = Data[num];
>> finish = clock();
>> duration = (double)(finish - start) / CLOCKS_PER_SEC;
>> printf("%4.2f Seconds\n", duration);
>> }
>>
>>
>>
>> int main() {
>> printf("Size in bytes--->%d\n", size * 4);
>> Data.reserve(size);
>> for (int N = 0; N < size; N++)
>> Data.push_back(rand() % size);
>>
>> char N;
>> printf("Hit any key to Continue:");
>> scanf("%c", &N);
>>
>> Process();
>>
>> return 0;
>> }
>>
>>
>>
>>
>
>
>
> --
> HLS


From: Hector Santos on
Peter Olcott wrote:

> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
> news:O0kgYUkyKHA.404(a)TK2MSFTNGP02.phx.gbl...
>> Peter Olcott wrote:
>>
>>> Try running your process again using a
>>> std::vector<unsigned int>
>>> Make sure that you initialize all of this to the
>>> subscript of the init loop.
>>> Make sure that the process monitor shows that the amount
>>> of memory you are allocating is the same amount that
>>> total memory is reduced by.
>>> Make sure that you only use 1/2 of total memory or less.
>>> Make a not of the page fault behavior.
>>> I will try the same thing.
>>
>> Like I said, you better! I'm done!
>
> I posted my code and my results. I ran it as a single
> process and two separate processes, concurrently.
> One process took 16.5 seconds of wall clock time.
> Two concurrent processes took 16.55 seconds of wall clock
> time.
> This proves that you were right all along.
> Which means that my process will scale much better than I
> expected.

I hear what you said and thats good that you think I am right, but you
didn't make it threaded, right.

The whole point of this exercise was to show each process you are
duplicating the std::vector allocation is putting pressure on the
system. By doing it one with X number of threads, you will see
different results.

Also, you had a line:

num = data[num]

you were not referencing the entire spectrum of your memory
allocation, just one element.

I bet if you fix that to num=data[i], then you will see your problem
again with two instances.

If you want to change your code to make it threaded, change it to this
and play around with MAX_THREADS. Try 1, 2 4, etc and pay attention to:

- Memory Load dispayed

and in Task Manager:

- Working Set (memory usage)
- Page Faults
- Page Faults Delta
- VM Size

There are all columns you can set via VIEW | Select Columns.

-------------- CUT HERE -----------------
#include <stdio.h>
#include <windows.h>
#include <stdlib.h>
#include <vector>
#include <time.h>
#include <conio.h>

const DWORD MAX_THREADS = 2;

#define uint32 unsigned int

const uint32 repeat = 1;
const uint32 size = 524288000 / 4;
std::vector<uint32> Data;

typedef struct _tagTThreadData {
DWORD index;
double duration;
} TThreadData;

TThreadData ThreadData[MAX_THREADS] = {0};

void WINAPI Process(TThreadData *data)
{
clock_t finish;
clock_t start = clock();
uint32 num;
for (uint32 r = 0; r < repeat; r++)
for (uint32 i = 0; i < size; i++)
num = Data[i];
finish = clock();
data->duration = (double)(finish - start) / CLOCKS_PER_SEC;
}

int main() {
printf("Size in bytes--->%d\n", size * 4);
Data.reserve(size);
for (int N = 0; N < size; N++)
Data.push_back(rand() % size);

char N;
printf("Hit any key to Continue:");
scanf("%c", &N);



HANDLE hThreads[MAX_THREADS] = {0};
DWORD tid;
DWORD i;

printf("* Starting threads\n");
for(i=0;i < MAX_THREADS;i++){
hThreads[i] = CreateThread(
NULL,
0,
(LPTHREAD_START_ROUTINE) Process,
(void *)&ThreadData[i],
0,
&tid);
}

printf("* Wait For Thread Completion\n");

while (WaitForMultipleObjects(MAX_THREADS, hThreads, TRUE, 100)
== WAIT_TIMEOUT) {
if (_kbhit() && _getch() == 27) break;
MEMORYSTATUSEX ms;
ms.dwLength = sizeof(ms);
GlobalMemoryStatusEx(&ms);
printf("- Memory Load: %d%%\r",ms.dwMemoryLoad);
}
printf("\n");

for (i = 0; i < MAX_THREADS; i++) {
TThreadData dt = ThreadData[i];
printf("%-3d | Time: %4.7f\n", i,dt.duration);
}

printf("* Done\n");

return 0;
}
-------------- CUT HERE -----------------




--
HLS