Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Hector Santos on 22 Mar 2010 15:37

Peter Olcott wrote:

> You keep bringing up memory mapped files. Although this may
> very well be a very good way to use disk as RAM, or to load
> RAM from disk, I do not see any possible reasoning that
> could every possibly show that a hybrid combination of disk
> and RAM could ever exceed the speed of pure RAM alone.

The reason is that you are not using pure RAM for the entire load of
data. Windows will virtualize everything. In basic terms, there are
two kinds - Windows using memory maps internally, that is how you get
system pages and ones that applications create themselves.

What you think you are still is "uninterrupted work" but it is
interrupted - thats called a Preemptive Operating System so your
application is never always in an active running state - you are
perceiving that it is, but it is not.

Think of it as a video picture frames. You perceive an interrupted
live animation or motion - they reality is that this are picture snap
shots (frames) displayed very rapidly and there are time gaps between
frames! For a PC, its called context switching and these gaps allow
other things to run.

The same with MEMORY - it is virtualized, even if you have 8GB!

Unless you tell windows:

Please do not CACHE this memory

then it is CACHED MEMORY.

So you have to make an EXPLICIT instruction via your CODE to tell
Windows not to CACHE your memory.

Your application because you don't know how to do this, is using
BUFFER I/O, CACHE, VIRTUALIZE MEMORY - by default.

> Reasoning is the ONLY source of truth that I trust,
> all other sources of truth are subject to errors.

Reasoning comes first by understanding the technology. If you don't,
then you have no right to judge experts, or anything else or presume
any conclusions about it where it ignorantly contradicts the realities
- realities understood by experts and those understanding the
technology at very practical levels.

--
HLS

From: Hector Santos on 22 Mar 2010 15:39

Yet again, the PROOF is not enough for you.

What you don't understand is that YOU, YOUR APPLICATION will never
deal with chip caching.

Your application deals with working sets and VIRTUAL MEMORY.

You proved that when you indicated the PAGE FAULTS - thats VIRTUAL
MEMORY OPERATIONS.

It has nothing to do with the L1, L2, L3 CHIP CACHING.

--
HLS

Peter Olcott wrote:

> Perhaps you did not understand what I said. The essential
> process inherently requires unpredictable access to memory
> such that cache spatial or temporal locality of reference
> rarely occurs.
>
> "Hector Santos" <sant9442(a)gmail.com> wrote in message
> news:e2aedb82-c9ad-44b3-8513-defe82cd876c(a)c16g2000yqd.googlegroups.com...
> On Mar 22, 11:02 am, "Peter Olcott" <NoS...(a)OCR4Screen.com>
> wrote:
>
>> (2) When a process requires essentially random (mostly
>> unpredictable) access to far more memory than can possibly
>> fit into the largest cache, then actual memory access time
>> becomes a much more significant factor in determining
>> actual
>> response time.
>
> As a follow up, in the simulator ProcessData() function:
>
> void ProcessData()
> {
> KIND num;
> for(DWORD r = 0; r < nRepeat; r++) {
> Sleep(1);
> for (DWORD i=0; i < size; i++) {
> //num = data[i]; // array
> num = fmdata[i]; // file mapping array view
> }
> }
> }
>
> This is a serialize access to the data. Its not random.
> When you have
> multi-threads, you approach a empirical boundary condition
> where
> multiple accessors are requesting the same memory. So in
> one hand,
> the peter viewpoint, you have contention issue hence slow
> downs. On
> the other hand, the you have a CACHING effect, where the
> reading done
> by one thread benefits all others.
>
> Now, we can alter this ProcessData() by adding a random
> access logic:
>
> void ProcessData()
> {
> KIND num;
> for(DWORD r = 0; r < nRepeat; r++) {
> Sleep(1);
> for (DWORD i=0; i < size; i++) {
> DWORD j = (rand() % size);
> //num = data[j]; // array
> num = fmdata[j]; // file mapping array view
> }
> }
> }
>
> One would suspect higher pressures to move virtual memory
> into the
> process working set in random fashion. But in reality,
> that
> randomness may not be as over pressuring as you expect.
>
> Lets test this randomness.
>
> First a test with serialized access with two thread using a
> 1.5GB file
> map.
>
> V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 2
> - Memory Load : 22%
> - Allocating Data .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 743 msecs.
> - Resuming thread# 1 in 868 msecs.
> * Wait For Thread Completion
> - Memory Load: 95%
> * Done
> ---------------------------------------
> 0 | Time: 5734 | Elapsed: 0
> 1 | Time: 4906 | Elapsed: 0
> ---------------------------------------
> Total Time: 10640
>
> Notice the MEMORY LOAD climbed to 95%, thats because the
> entire
> spectrum of the data was read in.
>
> Now lets try unpredictable random access. I added a /j
> switch to
> enable the random indexing.
>
> V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 /j
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 2
> - Memory Load : 22%
> - Allocating Data .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 116 msecs.
> - Resuming thread# 1 in 522 msecs.
> * Wait For Thread Completion
> - Memory Load: 23%
> * Done
> ---------------------------------------
> 0 | Time: 4250 | Elapsed: 0
> 1 | Time: 4078 | Elapsed: 0
> ---------------------------------------
> Total Time: 8328
>
> BEHOLD, it is even faster because of the randomness. The
> memory
> loading didn't climb because it didn't need to virtually
> load the
> entire 1.5GB into the process working set.
>
> So once again, your engineering (and lack thereof)
> philosophy is
> completely off base. You are under utilizing the power of
> your
> machine.
>
> --
> HLS
>
>

From: Peter Olcott on 22 Mar 2010 15:40

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:j0dfq5d859v5mbgtjfhcudtbb2d1re8f3d(a)4ax.com...
> See below...
> On Mon, 22 Mar 2010 10:02:33 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>message news:ioueq5hdsf5ut5pha6ttt88e1ghl4q9l1m(a)4ax.com...
>>> See below...
>>> On Sun, 21 Mar 2010 21:06:20 -0500, "Peter Olcott"
>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>>
>>>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>>>message
>>>>news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com...
>>>>> NOte in the i7 architecture the L3 cache is shared
>>>>> across
>>>>> all CPUs, so you are less likely
>>>>> to be hit by raw memory bandwidth (which compared to a
>>>>> CPU
>>>>> is dead-slow), and the answer s
>>>>> so whether multiple threads will work effectively can
>>>>> only
>>>>> be determined by measurement of
>>>>> a multithreaded app.
>>>>>
>>>>> Because your logic seems to indicate that raw memory
>>>>> speed
>>>>> is the limiting factor, and you
>>>>> have not accounted for the effects of a shared L3
>>>>> cache,
>>>>> any opnion you offer on what is
>>>>> going to happen is meaningless. In fact, any opinion
>>>>> about performanance is by definition
>>>>> meaningless; only actual measurements represent facts
>>>>> ("If
>>>>> you can't express it in
>>>>> numbers, it ain't science, it's opinion" -- Robert A.
>>>>> Heinlein)
>>>>
>>>>(1) Machine A performs process B in X minutes.
>>>>(2) Machine C performs process B in X/8 Minutes (800%
>>>>faster)
>>>>(3) The only difference between machine A and machine C
>>>>is
>>>>that machine C has much faster access to RAM (by
>>>>whatever
>>>>means).
>>>>(4) Therefore Process B is memory bandwidth bound.
>>> ***
>>> Fred can dig a ditch 10 feet long in 1 hour. Charlie
>>> can
>>> dig a ditch 10 feet long in 20
>>> minutes. Therefore, Charlie is faster than Fred by a
>>> factor of 3.
>>>
>>> How long does it take Fred and Charlie working together
>>> to
>>> dig a ditch 10 feet long?
>>> (Hint: any mathematical answer you come up with is
>>> wrong,
>>> because Fred and Charlie (a)
>>> hate each other, and so Charlie tosses his dirt into the
>>> place Fred has to dig or (b) are
>>> good buddies and stop for a beer halfway through the
>>> digging or (c) Chalie tells Fred he
>>> can do it faster by himself, and Fred just sits there
>>> while Charlie does all the work and
>>> finishes in 20 minutes, after which they go out for a
>>> beer. Fred buys.
>>>
>>> You have made an obvious failure here in thinking that
>>> if
>>> one thread takes 1/k the time
>>> and the only difference is memory bandwidth, that two
>>> threads are necessarily LINEAR. Duh!
>>> IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE
>>> NO
>>> DATA! You are jumping to an
>>
>>(1) People in a more specialized group are coming to the
>>same conclusions that I have derived.
> ****
> How? I have no idea how to predice L3 cache performance
> on an i7 system, and I don't
> believe they do, either. No theoretical model exists that
> is going to predict actual
> behavior, short of a detailed simulation,and I talked to
> Intel and they are not releasing
> performance statistics, period, so there is no way short
> of running the experiement to
> obtain a meaningful result.

Try and explain exactly how cache can possibly help when
there is most often essentially no spatial or temporal
locality of reference.

> ****
>>
>>(2) When a process requires essentially random (mostly
>>unpredictable) access to far more memory than can possibly
>>fit into the largest cache, then actual memory access time
>>becomes a much more significant factor in determining
>>actual
>>response time.
> ****
> What is your cache collision ratio, actually? Do you
> really understand the L3 cache
> replacement algorithm? (I can't find out anything about
> it on the Intel site! So I'm
> surprised you have this information, which Intel considers
> Corporate Confidential)
> ****
>>
>>> unwarranted conclusion based on what I can at best tell
>>> is
>>> a coincidence. And even if it
>>> was true, caches give nonlinear effects, so you are not
>>> even making sense when you make
>>> these assertions! You have proven a case for value N,
>>> but
>>> you have immediately assumed
>>> that if you prove the case for N, you have proven it for
>>> case N+1, which is NOT how
>>> inductive proofs work! So you were so hung up on
>>> geometric proofs, can you explain how,
>>> when doing an inductive proof, that proving the case for
>>> 1
>>> element tells you what the
>>> result is for N+1 for arbitrary value N? Hell, it
>>> doesn't even tell you the results for
>>> N=1, but you have immediately assumed that it is a valid
>>> proof for all values of N!
>>>
>>> YOU HAVE NO DATA! You are making a flawed assumption of
>>> linearity that has no basis!
>>> Going to your fixation on proof, in a nonlinear system
>>> without a closed-form analytic
>>> solution, demonstrate to me that your only possible
>>> solution is based on a linear
>>> assumption. You are ignoring all forms of reality here.
>>> You are asseting without basis
>>> that the system is linear (it is known that systems with
>>> caches are nonlinear in memory
>>> performance). So you are contradicting known reality
>>> without any evidence to support your
>>> "axiom". It ain't an axiom, it's a wild-assed-guess.
>>>
>>> Until you can demonstrate with actual measured
>>> performance
>>> that your system is COMPLETELY
>>> linear behavior in an L3 cache system, there is no
>>> reason
>>> to listen to any of this
>>> nonsense you keep esposusing as if it were "fact". You
>>> have ONE fact, and that is not
>>> enough to raise your hypothesis to the level of "axiom".
>>>
>>> All you have proven is that a single thread is limited
>>> by
>>> memory bandwidth. You have no
>>> reason to infer that two threads will not BOTH run
>>> faster
>>> because of the L3 cache effects.
>>> And you have ignored L1/L2 cache effects. You have a
>>> trivial example from which NOTHING
>>> can be inferred about multithreaded performance. You
>>> have
>>> consistently confused
>>> multiprocess programming with multithreading and arrived
>>> at erroneous conclusions based on
>>> flawed experiments.
>>>
>>> Note also if you use a memory-mapped file and two
>>> processes share the same mapping object
>>> there is only one copy of the data in memory! THis has
>>> not previously come up in
>>> discussions, but could be critical to your performance
>>> of
>>> multiple processes.
>>> joe
>>> ****
>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 22 Mar 2010 15:43

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:ecdfq5lb57qrou47d1ppaupsi6t2guu7nv(a)4ax.com...
> See below...
>
> On Mon, 22 Mar 2010 10:31:17 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in
>>message
>>news:%23Q4$1KdyKHA.404(a)TK2MSFTNGP02.phx.gbl...
>>> Joseph M. Newcomer wrote:
>>>
>>>
>>>> Note also if you use a memory-mapped file and two
>>>> processes share the same mapping object
>>>> there is only one copy of the data in memory! THis
>>>> has
>>>> not previously come up in
>>>> discussions, but could be critical to your performance
>>>> of
>>>> multiple processes.
>>>> joe
>>>
>>>
>>> He has been told that MMF can help him.
>>>
>>> --
>>> HLS
>>
>>Since my process (currently) requires unpredictable access
>>to far more memory than can fit into the largest cache, I
>>see no possible way that adding 1000-fold slower disk
>>access
>>could possibly speed things up. This seems absurd to me.
> ****
> He has NO CLUE as to what a "memory-mapped file" actually
> is. This last comment indicates

This is very likely true. Let's just drop this one until
someone explains all of the little nuances of exactly how
cache can greatly improve performance in the case where
there is essentially no spatial or temporal locality of
reference.

> total and complete cluelessness, plus a startling
> inabilitgy to understand that we are
> making USEFUL suggestions because WE KNOW what is going on
> and he has no idea.
>
> Like you, I'm giving up. There is only so long you can
> beat someone over the head with
> good ideas which they reject because they have no idea
> what you are talking about, but
> won't expend any energy to learn about, or ask questions
> about. Since he doesn't
> understand what shared sections are, or what they buy, and
> that a MMF is the way to get
> shared sections, I'm dropping out of this discussion. He
> has found a set of "experts" who
> agree with him (your example apparently doesn't convey the
> problem correctly), thinks
> memory-mapped files limit access to disk speed (not even
> understanding they are FASTER
> than ReadFile!) and has failed utterly to understand even
> the most basic concepts of an
> operagin system (thinking it is like an automatic
> transmission, where you can use it
> without knowing or caring about how it works, when what he
> is really doing is trying to
> build a competition racing machine and saying "all that
> stuff about the engine is
> irrelevant", whereas anyone who does competition racing
> (like my next-door neighbor did
> for years) knows why all this stuff is critical. If he
> were a racer, and we told him
> about power-shiftting (shifting a manual transmission
> without involving the clutch), he'd
> tell us he didn't need to understand that.
>
> Sad, really.
> joe
> ***
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 22 Mar 2010 15:45

Peter Olcott wrote:

> Try and explain exactly how cache can possibly help when
> there is most often essentially no spatial or temporal
> locality of reference.

Its called WINDOWS Virtual Memory Caching technology.

This is not DOS. You are not dealing directly with the CHIP here.

You need to stop reading stuff out, finding a new "buzz word" thinking
you got a "AH HA" and believe it proves your erroneous understanding
of Windows programming.

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system