Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Hector Santos on 22 Mar 2010 14:30

Joseph M. Newcomer wrote:

>> (1) People in a more specialized group are coming to the
>> same conclusions that I have derived.
> ****
> How? I have no idea how to predice L3 cache performance on an i7 system, and I don't
> believe they do, either. No theoretical model exists that is going to predict actual
> behavior, short of a detailed simulation,and I talked to Intel and they are not releasing
> performance statistics, period, so there is no way short of running the experiement to
> obtain a meaningful result.
> ****

Have you seen the posted C/C++ simulator and proof that shows how
using multiple threads and shared data trumps his single main thread
process theory?

>> (2) When a process requires essentially random (mostly
>> unpredictable) access to far more memory than can possibly
>> fit into the largest cache, then actual memory access time
>> becomes a much more significant factor in determining actual
>> response time.
> ****
> What is your cache collision ratio, actually? Do you really understand the L3 cache
> replacement algorithm? (I can't find out anything about it on the Intel site! So I'm
> surprised you have this information, which Intel considers Corporate Confidential)
> ****

Well, the thing is joe, is that this chip cache is something he will
using. This application will be use the cache the OS maintains.

He is thinking about stuff that he shouldn't be worry about. He
thinks his CODE deals directly with the chip caches.

--
HLS

From: Joseph M. Newcomer on 22 Mar 2010 14:32

See below...

On Mon, 22 Mar 2010 10:31:17 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:%23Q4$1KdyKHA.404(a)TK2MSFTNGP02.phx.gbl...
>> Joseph M. Newcomer wrote:
>>
>>
>>> Note also if you use a memory-mapped file and two
>>> processes share the same mapping object
>>> there is only one copy of the data in memory! THis has
>>> not previously come up in
>>> discussions, but could be critical to your performance of
>>> multiple processes.
>>> joe
>>
>>
>> He has been told that MMF can help him.
>>
>> --
>> HLS
>
>Since my process (currently) requires unpredictable access
>to far more memory than can fit into the largest cache, I
>see no possible way that adding 1000-fold slower disk access
>could possibly speed things up. This seems absurd to me.
****
He has NO CLUE as to what a "memory-mapped file" actually is. This last comment indicates
total and complete cluelessness, plus a startling inabilitgy to understand that we are
making USEFUL suggestions because WE KNOW what is going on and he has no idea.

Like you, I'm giving up. There is only so long you can beat someone over the head with
good ideas which they reject because they have no idea what you are talking about, but
won't expend any energy to learn about, or ask questions about. Since he doesn't
understand what shared sections are, or what they buy, and that a MMF is the way to get
shared sections, I'm dropping out of this discussion. He has found a set of "experts" who
agree with him (your example apparently doesn't convey the problem correctly), thinks
memory-mapped files limit access to disk speed (not even understanding they are FASTER
than ReadFile!) and has failed utterly to understand even the most basic concepts of an
operagin system (thinking it is like an automatic transmission, where you can use it
without knowing or caring about how it works, when what he is really doing is trying to
build a competition racing machine and saying "all that stuff about the engine is
irrelevant", whereas anyone who does competition racing (like my next-door neighbor did
for years) knows why all this stuff is critical. If he were a racer, and we told him
about power-shiftting (shifting a manual transmission without involving the clutch), he'd
tell us he didn't need to understand that.

Sad, really.
joe
***
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 22 Mar 2010 14:57

Joseph M. Newcomer wrote:

>>>
>>> He has been told that MMF can help him.
>>>
>>> --
>>> HLS
>> Since my process (currently) requires unpredictable access
>> to far more memory than can fit into the largest cache, I
>> see no possible way that adding 1000-fold slower disk access
>> could possibly speed things up. This seems absurd to me.
> ****
> He has NO CLUE as to what a "memory-mapped file" actually is. This last comment indicates
> total and complete cluelessness, plus a startling inabilitgy to understand that we are
> making USEFUL suggestions because WE KNOW what is going on and he has no idea.

What he doesn't realize is that his 4GB loading is already
virtualized. He believes that all of that is in pure RAM. The pages
fault prove that point but he doesn't understand what that means.

He doesn't realize that his PC is techically a VIRTUAL MACHINE! He
doesn't understand the INTEL memory segmentation framework. Maybe he
this its DOS? That is why I said if he wants PURE RAM operations, he
might be better off with a 16 bit DMPI DOS program or moving over to a
MOTOROLA chip that will over offer a linear memory model - if that is
still true today.

> Like you, I'm giving up.

There are two parts:

First, I'm actually exploring scaling methods with the simulator I
wrote for him. I have a version where I am exploring NUMA that will
leverage 2003+ Windows technology. I am going to pencil in getting a
test computer with a Intel XEON that offer NUMA.

Second, get some good will out of this if I can convince this guy that
he needs to change his application to better perform. Or at least
understand this his old memory usage paradigm for processes does not
apply under Windows. The only reason I can suspect for his ignorance
is that he is not a programmer or at the very least, very primitive
nature of programming knowledge. A real Windows programmer would
under this this basic principles or at least explore what experts are
saying. He is not even exploring anything!

> I'm dropping out of this discussion.

I should too.

--
HLS

From: Hector Santos on 22 Mar 2010 14:58

Hector Santos wrote:

>
> Well, the thing is joe, is that this chip cache is something he will
> using.

I meant "is NOT something..."

--
HLS

From: Peter Olcott on 22 Mar 2010 15:31

Perhaps you did not understand what I said. The essential
process inherently requires unpredictable access to memory
such that cache spatial or temporal locality of reference
rarely occurs.

"Hector Santos" <sant9442(a)gmail.com> wrote in message
news:e2aedb82-c9ad-44b3-8513-defe82cd876c(a)c16g2000yqd.googlegroups.com...
On Mar 22, 11:02 am, "Peter Olcott" <NoS...(a)OCR4Screen.com>
wrote:

> (2) When a process requires essentially random (mostly
> unpredictable) access to far more memory than can possibly
> fit into the largest cache, then actual memory access time
> becomes a much more significant factor in determining
> actual
> response time.

As a follow up, in the simulator ProcessData() function:

void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
Sleep(1);
for (DWORD i=0; i < size; i++) {
//num = data[i]; // array
num = fmdata[i]; // file mapping array view
}
}
}

This is a serialize access to the data. Its not random.
When you have
multi-threads, you approach a empirical boundary condition
where
multiple accessors are requesting the same memory. So in
one hand,
the peter viewpoint, you have contention issue hence slow
downs. On
the other hand, the you have a CACHING effect, where the
reading done
by one thread benefits all others.

Now, we can alter this ProcessData() by adding a random
access logic:

void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
Sleep(1);
for (DWORD i=0; i < size; i++) {
DWORD j = (rand() % size);
//num = data[j]; // array
num = fmdata[j]; // file mapping array view
}
}
}

One would suspect higher pressures to move virtual memory
into the
process working set in random fashion. But in reality,
that
randomness may not be as over pressuring as you expect.

Lets test this randomness.

First a test with serialized access with two thread using a
1.5GB file
map.

V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 2
- Memory Load : 22%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 743 msecs.
- Resuming thread# 1 in 868 msecs.
* Wait For Thread Completion
- Memory Load: 95%
* Done
---------------------------------------
0 | Time: 5734 | Elapsed: 0
1 | Time: 4906 | Elapsed: 0
---------------------------------------
Total Time: 10640

Notice the MEMORY LOAD climbed to 95%, thats because the
entire
spectrum of the data was read in.

Now lets try unpredictable random access. I added a /j
switch to
enable the random indexing.

V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 /j
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 2
- Memory Load : 22%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 116 msecs.
- Resuming thread# 1 in 522 msecs.
* Wait For Thread Completion
- Memory Load: 23%
* Done
---------------------------------------
0 | Time: 4250 | Elapsed: 0
1 | Time: 4078 | Elapsed: 0
---------------------------------------
Total Time: 8328

BEHOLD, it is even faster because of the randomness. The
memory
loading didn't climb because it didn't need to virtually
load the
entire 1.5GB into the process working set.

So once again, your engineering (and lack thereof)
philosophy is
completely off base. You are under utilizing the power of
your
machine.

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system