Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Peter Olcott on 22 Mar 2010 15:45

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:uqklKcfyKHA.2552(a)TK2MSFTNGP04.phx.gbl...
> Peter Olcott wrote:
>
>> You keep bringing up memory mapped files. Although this
>> may very well be a very good way to use disk as RAM, or
>> to load RAM from disk, I do not see any possible
>> reasoning that could every possibly show that a hybrid
>> combination of disk and RAM could ever exceed the speed
>> of pure RAM alone.
>
>
> The reason is that you are not using pure RAM for the
> entire load of data. Windows will virtualize everything.
> In basic terms, there are two kinds - Windows using memory
> maps internally, that is how you get system pages and ones
> that applications create themselves.

It loads my data and then the process monitor tells me that
their are no page faults even when the process is invoked 12
hours later.

>
> What you think you are still is "uninterrupted work" but
> it is interrupted - thats called a Preemptive Operating
> System so your application is never always in an active
> running state - you are perceiving that it is, but it is
> not.
>
> Think of it as a video picture frames. You perceive an
> interrupted live animation or motion - they reality is
> that this are picture snap shots (frames) displayed very
> rapidly and there are time gaps between frames! For a
> PC, its called context switching and these gaps allow
> other things to run.
>
> The same with MEMORY - it is virtualized, even if you have
> 8GB!
>
> Unless you tell windows:
>
> Please do not CACHE this memory
>
> then it is CACHED MEMORY.
>
> So you have to make an EXPLICIT instruction via your CODE
> to tell Windows not to CACHE your memory.
>
> Your application because you don't know how to do this, is
> using BUFFER I/O, CACHE, VIRTUALIZE MEMORY - by default.
>
>> Reasoning is the ONLY source of truth that I trust, all
>> other sources of truth are subject to errors.
>
>
> Reasoning comes first by understanding the technology. If
> you don't, then you have no right to judge experts, or
> anything else or presume any conclusions about it where it
> ignorantly contradicts the realities - realities
> understood by experts and those understanding the
> technology at very practical levels.
>
> --
> HLS

From: Peter Olcott on 22 Mar 2010 15:56

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:ehG4qdfyKHA.2552(a)TK2MSFTNGP04.phx.gbl...
> Yet again, the PROOF is not enough for you.
>
> What you don't understand is that YOU, YOUR APPLICATION
> will never deal with chip caching.
>
> Your application deals with working sets and VIRTUAL
> MEMORY.
>
> You proved that when you indicated the PAGE FAULTS - thats
> VIRTUAL MEMORY OPERATIONS.
>
> It has nothing to do with the L1, L2, L3 CHIP CACHING.

No not quite nothing, both chip caching and page faults have
to do with memory access.

L1, L2, L3 chip caching are also very dependent upon spatial
and/or temporal locality of reference. Maybe you know a way
that chip caching can work without either temporal or
spatial locality or reference, I do not.

As far as I can tell the only possible general approach to
chip caching that could possibly work that does not depend
upon locality of reference would be for the chip to somehow
comprehend enough of the underlying algorithm to predict
memory access patterns.

>
> --
> HLS
>
> Peter Olcott wrote:
>
>> Perhaps you did not understand what I said. The essential
>> process inherently requires unpredictable access to
>> memory such that cache spatial or temporal locality of
>> reference rarely occurs.
>>
>> "Hector Santos" <sant9442(a)gmail.com> wrote in message
>> news:e2aedb82-c9ad-44b3-8513-defe82cd876c(a)c16g2000yqd.googlegroups.com...
>> On Mar 22, 11:02 am, "Peter Olcott"
>> <NoS...(a)OCR4Screen.com> wrote:
>>
>>> (2) When a process requires essentially random (mostly
>>> unpredictable) access to far more memory than can
>>> possibly
>>> fit into the largest cache, then actual memory access
>>> time
>>> becomes a much more significant factor in determining
>>> actual
>>> response time.
>>
>> As a follow up, in the simulator ProcessData() function:
>>
>> void ProcessData()
>> {
>> KIND num;
>> for(DWORD r = 0; r < nRepeat; r++) {
>> Sleep(1);
>> for (DWORD i=0; i < size; i++) {
>> //num = data[i]; // array
>> num = fmdata[i]; // file mapping array view
>> }
>> }
>> }
>>
>> This is a serialize access to the data. Its not random.
>> When you have
>> multi-threads, you approach a empirical boundary
>> condition where
>> multiple accessors are requesting the same memory. So
>> in one hand,
>> the peter viewpoint, you have contention issue hence
>> slow downs. On
>> the other hand, the you have a CACHING effect, where the
>> reading done
>> by one thread benefits all others.
>>
>> Now, we can alter this ProcessData() by adding a random
>> access logic:
>>
>> void ProcessData()
>> {
>> KIND num;
>> for(DWORD r = 0; r < nRepeat; r++) {
>> Sleep(1);
>> for (DWORD i=0; i < size; i++) {
>> DWORD j = (rand() % size);
>> //num = data[j]; // array
>> num = fmdata[j]; // file mapping array view
>> }
>> }
>> }
>>
>> One would suspect higher pressures to move virtual memory
>> into the
>> process working set in random fashion. But in reality,
>> that
>> randomness may not be as over pressuring as you expect.
>>
>> Lets test this randomness.
>>
>> First a test with serialized access with two thread using
>> a 1.5GB file
>> map.
>>
>> V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 2
>> - Memory Load : 22%
>> - Allocating Data .... 0
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> * Resuming threads
>> - Resuming thread# 0 in 743 msecs.
>> - Resuming thread# 1 in 868 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 95%
>> * Done
>> ---------------------------------------
>> 0 | Time: 5734 | Elapsed: 0
>> 1 | Time: 4906 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 10640
>>
>> Notice the MEMORY LOAD climbed to 95%, thats because the
>> entire
>> spectrum of the data was read in.
>>
>> Now lets try unpredictable random access. I added a /j
>> switch to
>> enable the random indexing.
>>
>> V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 /j
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 2
>> - Memory Load : 22%
>> - Allocating Data .... 0
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> * Resuming threads
>> - Resuming thread# 0 in 116 msecs.
>> - Resuming thread# 1 in 522 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 23%
>> * Done
>> ---------------------------------------
>> 0 | Time: 4250 | Elapsed: 0
>> 1 | Time: 4078 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 8328
>>
>> BEHOLD, it is even faster because of the randomness.
>> The memory
>> loading didn't climb because it didn't need to virtually
>> load the
>> entire 1.5GB into the process working set.
>>
>> So once again, your engineering (and lack thereof)
>> philosophy is
>> completely off base. You are under utilizing the power
>> of your
>> machine.
>>
>> --
>> HLS
>

From: Joseph M. Newcomer on 22 Mar 2010 15:57

On Mon, 22 Mar 2010 11:14:27 -0400, Hector Santos <sant9442(a)nospam.gmail.com> wrote:

>Peter Olcott wrote:
>
>
>> A group with a more specialized focus is coming to the same
>> conclusions that I have derived.
>
>Oh Peter, you're fibbing! The simulator I provided is a classic
>example of an expert on the subject in action. If you wanted to learn
>anything here, you should study it.
****
Actually, I believe he is telling the truth. He has fallen into a group that is largely
clueless also, but they look like experts because they are agreeing with him.
****
>
>The process handler emulates your MEMORY ACCESS claims to the fullest
>extent with minimum OP CODES of any other work. Any engineer (and by
>the way, I am trained Chemical Engineer) with process control and
>simulation experience can easily see the work I showed as proof in
>invalidating your understanding and shows how multi-threads with
>shared memory is superior to your single main thread process idea.
***
In science, it takes only ONE counterexample to sink any theory. You have the
counterexample, Peter has the theory. Q.E.D.
****
>
>If you can see that in the code, then quite honestly, you don't know
>how to program or understand the concept of programming.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 22 Mar 2010 16:11

See below...

On Mon, 22 Mar 2010 14:30:46 -0400, Hector Santos <sant9442(a)nospam.gmail.com> wrote:

>Joseph M. Newcomer wrote:
>
>>> (1) People in a more specialized group are coming to the
>>> same conclusions that I have derived.
>> ****
>> How? I have no idea how to predice L3 cache performance on an i7 system, and I don't
>> believe they do, either. No theoretical model exists that is going to predict actual
>> behavior, short of a detailed simulation,and I talked to Intel and they are not releasing
>> performance statistics, period, so there is no way short of running the experiement to
>> obtain a meaningful result.
>> ****
>
>
>Have you seen the posted C/C++ simulator and proof that shows how
>using multiple threads and shared data trumps his single main thread
>process theory?
***
Yes. Note that I mentioned that your counterexample trumps his theory. His theory is so
full of holes it is hard to imagine why he is clinging to it with such ferocity, given we
keep telling him he is wrong.

And the CORRECT apporach, if he believed that your code doesn't represent his problem
domain, would be to read it, modify it to fit his model, and run it. But that would
potentially expose his theory to absolute destruction, or give him useful data by which he
could dtermine what is going to happen, and neither of those seems to be his priority. His
failure to undertand or even look into Memory Mapped Files, but instead come up with some
off-the-wall idea of how they behave which, unfortunately, is not at all like they
ACTUALLY behave, or realize that using them with a shared mapping object would reduce the
memory footprint of multiple processes, is indicsative of a completely closed mind. We
are really wasting our time here; he doesn't want to get answers, just tell us why we are
wrong. And ignoring the fact that we have been doing multithreading decades longer than
he has. I've been doing it since 1975. Or 1968, depending on what criteria you apply.
ANd when I point out obvious aspects he has ignored, such as cache influence, he tells
me he doesn't need to know this, because the wants to think of the OS as a "black box"
that works according to his imagined ideals, not how actual operating systems work
****
>
>
>>> (2) When a process requires essentially random (mostly
>>> unpredictable) access to far more memory than can possibly
>>> fit into the largest cache, then actual memory access time
>>> becomes a much more significant factor in determining actual
>>> response time.
>> ****
>> What is your cache collision ratio, actually? Do you really understand the L3 cache
>> replacement algorithm? (I can't find out anything about it on the Intel site! So I'm
>> surprised you have this information, which Intel considers Corporate Confidential)
>> ****
>
>
>Well, the thing is joe, is that this chip cache is something he will
>using. This application will be use the cache the OS maintains
>
>He is thinking about stuff that he shouldn't be worry about. He
>thinks his CODE deals directly with the chip caches.
****
A fact I keep beating him over the head with, but he chooses to ignore reality and
experience over errnoneous experiments that have provided no useful information. Note
that all his ranting is based on ONE experiment betweeen incomparable systems that
measures only ONE thread, and fails to take into account nonlinearities of caching. And
he refuses to listen to alternative suggestions because he misunderstands the technology
and doesn't appreciate what is really going on inside the OS or the hardware.

But that's pretty obviuos, which is why I've givne up making suggestions; he simply won't
listen to anyone excpet this hypothetical group of experts who must be right because they
agree with him.
joe
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 22 Mar 2010 16:23

Peter Olcott wrote:

> "Hector Santos"

>> It has nothing to do with the L1, L2, L3 CHIP CACHING.

> As far as I can tell the only possible general approach to
> chip caching that could possibly work that does not depend
> upon locality of reference would be for the chip to somehow
> comprehend enough of the underlying algorithm to predict
> memory access patterns.

Here's the thing:

Why are you worrying about this when you don't even know how to
program for it any any level?

You are at at the USER LEVEL, not KERNEL LEVEL or FILE DRIVER LEVEL!!

Do you really think your application needs Advanced Memory Chip
technology in order to work?

Do you really think this all works in slow motion?

Your application is extremely primitive - it is really is. You have
too much belief that your application is BEYOND the needs of any other
application with load data needs.

You have no engineering sense whatsoever. Even if you say that your
response time is 100ms - SO WHAT if its 1100 ms with multi-threads?

If your response time is 100ms, worrying about CHIP LEVEL stuff is crazy.

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system