Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Hector Santos on 22 Mar 2010 13:38

Peter Olcott wrote:

>> The code I posted is the proof!
>
> If it requires essentially nothing besides random access to
> entirely different places of 100 MB of memory, thenn (then
> and only then) would it be reasonably representative of my
> process. Nearly all the my process does is look up in memory
> the next place to look up in memory.

Good point, and moments ago I proved how random access is even BETTER!
I posted the work via google groups and I have not see it yet here
in microsoft's server. Lets wait 10 minutes or so before i post it
again here... FT! Here it is:

> (2) When a process requires essentially random (mostly
> unpredictable) access to far more memory than can possibly
> fit into the largest cache, then actual memory access time
> becomes a much more significant factor in determining actual
> response time.

As a follow up, in the simulator ProcessData() function:

void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
Sleep(1);
for (DWORD i=0; i < size; i++) {
//num = data[i]; // array
num = fmdata[i]; // file mapping array view
}
}
}

This is a serialize access to the data. Its not random. When you have
multi-threads, you approach a empirical boundary condition where
multiple accessors are requesting the same memory. So in one hand,
the peter viewpoint, you have contention issue hence slow downs. On
the other hand, the you have a CACHING effect, where the reading done
by one thread benefits all others.

Now, we can alter this ProcessData() by adding a random access logic:

void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
Sleep(1);
for (DWORD i=0; i < size; i++) {
DWORD j = (rand() % size);
//num = data[j]; // array
num = fmdata[j]; // file mapping array view
}
}
}

One would suspect higher pressures to move virtual memory into the
process working set in random fashion. But in reality, that
randomness may not be as over pressuring as you expect.

Lets test this randomness. First a test with serialized access with
two thread using a 1.5GB file map.

V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 2
- Memory Load : 22%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 743 msecs.
- Resuming thread# 1 in 868 msecs.
* Wait For Thread Completion
- Memory Load: 95%
* Done
---------------------------------------
0 | Time: 5734 | Elapsed: 0
1 | Time: 4906 | Elapsed: 0
---------------------------------------
Total Time: 10640
Notice the MEMORY LOAD climbed to 95%, thats because the entire
spectrum of the data was read in.
Now lets try unpredictable random access. I added a /j switch to
enable the random indexing.
V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 /j
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 2
- Memory Load : 22%
- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 in 116 msecs.
- Resuming thread# 1 in 522 msecs.
* Wait For Thread Completion
- Memory Load: 23%
* Done
---------------------------------------
0 | Time: 4250 | Elapsed: 0
1 | Time: 4078 | Elapsed: 0
---------------------------------------
Total Time: 8328

BEHOLD, it is even faster because of the randomness. The memory
loading didn't climb because it didn't need to virtually load the
entire 1.5GB into the process working set.

So once again, your engineering (and lack thereof) philosophy is
completely off base. You are under utilizing the power of your
machine.

I take pride in my extremely rich real world modeling experience, a
good bit of it in simulator advanced systems like a Nuclear Power
Plant, Wind Turbines, Steam Generators, Process Control Systems, Fault
Tolerance, Failure Analysis, Prediction Systems, AI and Expert
Systems, etc, etc, etc. I am very very very GOOD at this and I carry
that knowledge into my design and QA engineering of my product lines.

Your application is PRIMITIVE when it comes to the real world
applications of there. Case in point, just for our FTP server alone,
we have customers that are uploading HUGE files, including gigabytes.
That has to be processed by many parts of the entire framework in a
scaled manner, even if its just 1 request that happen to come in.

--
HLS

From: Mikel on 22 Mar 2010 13:44

On 22 mar, 18:21, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
> "Hector Santos" <sant9...(a)nospam.gmail.com> wrote in message
>
> news:%23o9WRudyKHA.3304(a)TK2MSFTNGP06.phx.gbl...
>
>
>
>
>
> > Peter Olcott wrote:
>
> >>> If you can see that in the code, then quite honestly,
> >>> you don't know how to program or understand the concept
> >>> of programming.
>
> >> I am telling you the truth, I am almost compulsive about
> >> telling the truth. When the conclusions are final I will
> >> post a link here.
>
> > What GROUP is this? No one will trust your SUMMARY unless
> > you cite the group. Until you do so, you're lying and
> > making things up.
>
> > I repeat: If you can't see the code I posted proves your
> > thinking is incorrect, you don't know what you are talking
> > about and its becoming obvious now you don't have any kind
> > of programming or even engineering skills.
>
> > --
> > HLS
>
> I did not examine the code because I did not want to spend
> time looking at something that is not representative of my
> process. Looks at the criteria on my other post, and if you
> agree that it meets this criteria, then I will look at your
> code.
>
> You keep bringing up memory mapped files. Although this may
> very well be a very good way to use disk as RAM, or to load
> RAM from disk, I do not see any possible reasoning that
> could every possibly show that a hybrid combination of disk
> and RAM could ever exceed the speed of pure RAM alone.
>
> If you can then please show me the reasoning that supports
> this. Reasoning is the ONLY source of truth that I trust,
> all other sources of truth are subject to errors. Reasoning
> is also subject to errors, but, these errors can be readily
> discerned as breaking one or more of the rules of correct
> reasoning.- Ocultar texto de la cita -
>
> - Mostrar texto de la cita -

Hi,
First of all, sorry for jumping in. I'm no expert in this area, so I
can't help much in it.
However, I would like to make a couple of comments.
First of all, I think that if you ask a question in a forum, newsgroup
or whatever, you should be ready to accept the answer, even if you
don't like it or is against what you thought. If you are not going to
accept or at least take into account the opinion of people who know
probably better than you, don't ask. If you *know* that you solution
is the best, don't ask if it is, but if you ask, consider the option
that what other people say is true (or truer than what you think).
Second, regarding "Reasoning is the ONLY source of truth that I
trust", that's what ancient greeks did, and they reached truths like
"heavier objects fall faster than lighter ones". The scientific method
has proved as a better way to reach the truth in science. Set an
hypothesis, test it and reach a conclusion. If 1 example breaks the
hypothesis, it was wrong. stop. Change it. But you have to test it.
Besides, trying to even imagine what goes on in a multicore system,
with its caches and all, the OS, etc. and trying to control it, seems
like a task for someone who probably should not be asking the
questions, but answering them.
Just an opinion...

From: Hector Santos on 22 Mar 2010 13:56

Hector Santos wrote:

> Peter Olcott wrote:

> One would suspect higher pressures to move virtual memory into the
> process working set in random fashion. But in reality, that randomness
> may not be as over pressuring as you expect.
>

> ...
>

> BEHOLD, it is even faster because of the randomness. The memory
> loading didn't climb because it didn't need to virtually load the entire
> 1.5GB into the process working set.

Peter, when you talk about unpredictable scenario, in the simulation
world, there are techniques to you to give you low and high boundary
conditions and also predict the most likely scenario.

Take the time and read concepts such as Pareto's Principle and how its
use in real world practices including high end complex simulations.

I say Pareto here because for YOU, your OCR application, it applies
VERY nicely.

Side note: The RISC computer is based on Pareto's principle, it uses
the MOST used instructions sets 80% of the time - that is why it is
faster. As the story goes, myth or not, the inventor of the RISC chip
use his disorganize mess on his desk to get that flash of genuis. He
notice that his most important papers and memos where migrating to the
top of the PILE and the the least used one went to the bottom never
hardly to be used or seen again. He calculated the near MAGIC pareto
ratio of 80:20 or 80% that is so apparent in so many things in real
life. So he invented the Reduced Instruction Set Chip - RISC!

For your OCR, I take it there is a character set. One optimization
would be to use pareto's principle here to make sure that 80% of the
most use characters are "more" accessible or "faster" than the rest.

There are all kinds of ideas here. That is why Geoff, Joe, and I and
others who provided input has said you can not use your primitive
single-thread process modeling as a BEST CASE scenario - you can't
because it would be incorrect to believe it would be when you didn't
even try to leverage anything the the computer and hardware offers.

--
HLS

From: Joseph M. Newcomer on 22 Mar 2010 14:22

See below...
On Mon, 22 Mar 2010 10:02:33 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:ioueq5hdsf5ut5pha6ttt88e1ghl4q9l1m(a)4ax.com...
>> See below...
>> On Sun, 21 Mar 2010 21:06:20 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>>>
>>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>>message news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com...
>>>> NOte in the i7 architecture the L3 cache is shared
>>>> across
>>>> all CPUs, so you are less likely
>>>> to be hit by raw memory bandwidth (which compared to a
>>>> CPU
>>>> is dead-slow), and the answer s
>>>> so whether multiple threads will work effectively can
>>>> only
>>>> be determined by measurement of
>>>> a multithreaded app.
>>>>
>>>> Because your logic seems to indicate that raw memory
>>>> speed
>>>> is the limiting factor, and you
>>>> have not accounted for the effects of a shared L3 cache,
>>>> any opnion you offer on what is
>>>> going to happen is meaningless. In fact, any opinion
>>>> about performanance is by definition
>>>> meaningless; only actual measurements represent facts
>>>> ("If
>>>> you can't express it in
>>>> numbers, it ain't science, it's opinion" -- Robert A.
>>>> Heinlein)
>>>
>>>(1) Machine A performs process B in X minutes.
>>>(2) Machine C performs process B in X/8 Minutes (800%
>>>faster)
>>>(3) The only difference between machine A and machine C is
>>>that machine C has much faster access to RAM (by whatever
>>>means).
>>>(4) Therefore Process B is memory bandwidth bound.
>> ***
>> Fred can dig a ditch 10 feet long in 1 hour. Charlie can
>> dig a ditch 10 feet long in 20
>> minutes. Therefore, Charlie is faster than Fred by a
>> factor of 3.
>>
>> How long does it take Fred and Charlie working together to
>> dig a ditch 10 feet long?
>> (Hint: any mathematical answer you come up with is wrong,
>> because Fred and Charlie (a)
>> hate each other, and so Charlie tosses his dirt into the
>> place Fred has to dig or (b) are
>> good buddies and stop for a beer halfway through the
>> digging or (c) Chalie tells Fred he
>> can do it faster by himself, and Fred just sits there
>> while Charlie does all the work and
>> finishes in 20 minutes, after which they go out for a
>> beer. Fred buys.
>>
>> You have made an obvious failure here in thinking that if
>> one thread takes 1/k the time
>> and the only difference is memory bandwidth, that two
>> threads are necessarily LINEAR. Duh!
>> IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE NO
>> DATA! You are jumping to an
>
>(1) People in a more specialized group are coming to the
>same conclusions that I have derived.
****
How? I have no idea how to predice L3 cache performance on an i7 system, and I don't
believe they do, either. No theoretical model exists that is going to predict actual
behavior, short of a detailed simulation,and I talked to Intel and they are not releasing
performance statistics, period, so there is no way short of running the experiement to
obtain a meaningful result.
****
>
>(2) When a process requires essentially random (mostly
>unpredictable) access to far more memory than can possibly
>fit into the largest cache, then actual memory access time
>becomes a much more significant factor in determining actual
>response time.
****
What is your cache collision ratio, actually? Do you really understand the L3 cache
replacement algorithm? (I can't find out anything about it on the Intel site! So I'm
surprised you have this information, which Intel considers Corporate Confidential)
****
>
>> unwarranted conclusion based on what I can at best tell is
>> a coincidence. And even if it
>> was true, caches give nonlinear effects, so you are not
>> even making sense when you make
>> these assertions! You have proven a case for value N, but
>> you have immediately assumed
>> that if you prove the case for N, you have proven it for
>> case N+1, which is NOT how
>> inductive proofs work! So you were so hung up on
>> geometric proofs, can you explain how,
>> when doing an inductive proof, that proving the case for 1
>> element tells you what the
>> result is for N+1 for arbitrary value N? Hell, it
>> doesn't even tell you the results for
>> N=1, but you have immediately assumed that it is a valid
>> proof for all values of N!
>>
>> YOU HAVE NO DATA! You are making a flawed assumption of
>> linearity that has no basis!
>> Going to your fixation on proof, in a nonlinear system
>> without a closed-form analytic
>> solution, demonstrate to me that your only possible
>> solution is based on a linear
>> assumption. You are ignoring all forms of reality here.
>> You are asseting without basis
>> that the system is linear (it is known that systems with
>> caches are nonlinear in memory
>> performance). So you are contradicting known reality
>> without any evidence to support your
>> "axiom". It ain't an axiom, it's a wild-assed-guess.
>>
>> Until you can demonstrate with actual measured performance
>> that your system is COMPLETELY
>> linear behavior in an L3 cache system, there is no reason
>> to listen to any of this
>> nonsense you keep esposusing as if it were "fact". You
>> have ONE fact, and that is not
>> enough to raise your hypothesis to the level of "axiom".
>>
>> All you have proven is that a single thread is limited by
>> memory bandwidth. You have no
>> reason to infer that two threads will not BOTH run faster
>> because of the L3 cache effects.
>> And you have ignored L1/L2 cache effects. You have a
>> trivial example from which NOTHING
>> can be inferred about multithreaded performance. You have
>> consistently confused
>> multiprocess programming with multithreading and arrived
>> at erroneous conclusions based on
>> flawed experiments.
>>
>> Note also if you use a memory-mapped file and two
>> processes share the same mapping object
>> there is only one copy of the data in memory! THis has
>> not previously come up in
>> discussions, but could be critical to your performance of
>> multiple processes.
>> joe
>> ****

>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 22 Mar 2010 14:25

Hector Santos wrote:

> I say Pareto here because for YOU, your OCR application, it applies VERY
> nicely.
>

> ..
>

> For your OCR, I take it there is a character set. One optimization
> would be to use pareto's principle here to make sure that 80% of the
> most use characters are "more" accessible or5 "faster" than the rest.

To simulate this, you would basically have two random calculations in
ProcessData():

1) Random selection to decide to use 80% group or 20% group
2) Random selection to decide to which index within the group.

Of course, you have to decide what characters are the most used.

In the simulator (version 3 with the file mapping version), I used one
file mapping class that allows me to quickly do something like this:

typedef struct tagTData {
char reserve[512];
} TData;

TFileMap<TData> fmdata("d:\\largefilemap.dat", 3000000);

which will open/create a ISAM file with TData Records and file map the
file handle. 3000000 * 512 equals 1.5 GB

The class allows me to access the records by index using operators, so
I can do something like this:

TData rec = fmdata[i];

I can change the TData to this:

typedef struct tagTData {
BOOL mostused; // set if its a most used item
char reserve[508];
} TData;

or I can use two maps, one most used and the other least used:

TFileMap<TData> mfmdata("d:\\largefilemap-most.dat", 3000000 *.80);
TFileMap<TData> lfmdata("d:\\largefilemap-least.dat", 3000000 *.20);

Either way, you can simulate the idea of most vs least to get a very
close emulation of what your system and prediction of what to expect.

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system