From: Joseph M. Newcomer on
See below...
On Sun, 21 Mar 2010 00:16:08 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:eGiPNBLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
>> Peter Olcott wrote:
>>
>>> Would you agree that if the empirical test shows that my
>>> single process is taking up 100% of the memory bandwidth,
>>> that multiple cores or multiple threads could not help
>>> increase speed?
>>
>>
>> You are thinking about this all wrong.
>>
>> You have a quantum based context switching you can't stop,
>> even for a single thread process. In other words, you
>> will never have 100% full exclusive control of MEMORY
>> ACCESS - never. If you did, nothing else will run.
>
>OK fine then nitpick to death the insignificant imprecision
>of my statement.
>
>The next question is an attempt to see if we can agree on
>anything or that your mind is stuck in refute mode. Please
>do not avoid this question, please answer this question.
>
>Given (as in Geometry) as an immutable premise (to be taken
>as true even if it is false) that my process takes
>essentially ALL of the memory bandwidth, then within the
>specific context of this immutable premise (to be taken as
>true even if it is false) could adding another such process
>speed things up or would it slow them down?
****
The point of geometry is that within a given structure, you CAN have axioms (your
"immutable premise"), but you are making an ASSERTION (without proof) than your
application consumes all of memory bandwidth. This is NOT an axiom; you HYPOTHESIZE that
this is true based on some bad experiments that produce numbers which, without any proof,
you assume demonstrate this beyond question. That is NOT the same as a proof in geometry
(and did you know that most plane geometry has as an underlying axiom a flat plane; apply
the axioms of classic geometry to spheres, or Minkowski spaces (which can be concave,
convex, or saddle-shaped) and everything falls apart. Paralell lines meet, for example.
So you are asserting a geometry and a set of unfounded axioms which you have asserted
without proof)

You are also insisting on confusing multiple processes with multiple threads. I have no
idea why you keep making this fundamental error.
****
>
>> What I am saying is this, suppose you have 10 lines of DFA
>> C code, the compiler creates OP CODES for these 10 lines.
>> Each OP CODE has a fixed frequency cycle. When the
>> accumulated frequency reaches a QUANTUM (~15ms), you will
>> get a context switch - in other words, your code is
>> preempted (stop), swapped out, and Windows will give all
>> other threads a change to run.
>
>This does not occur on my quad-core machine. I consistently
>get every bit of all of the CPU cycles of a single core.
****
actually, you have NOTHING to justify this assertion. Do you know how often the CPU stalls
because data is not present? No, I have no idea how to derive this information, but
neither do you. Yet you use this unfounded premise as an axiom in the system. In
geometry, we might prove a lemma first, and then, having proven the lemma, we go on to use
that as part of the proof of a more complex theorem. But in all cases, we make no
presumption about unproven hypoteheses; we can can use one in a proof of a theorem, but
until we can prove that hypothesis, we cannot consider our proof of the theorem valid
(read the history of the recent proof of Fermat's Last Theorem! Turns out he used an
unproven leap of faith to proven part of the theorem, and had to go back and apply some
subtle numerical theory to prove his unfounded hypothesis, which he was able to do!)


At no point do you have a clue that this "axiom" you have, that you are using all of
memory bandwidth, is valid, nor do you have anything to prove that the shared L3 cache is
not going to amortize the costs across all cores running the code. I see no axiom here.
Just a lot of unfoundeded hypothesis.
*****
>
>>
>> That gives other threads in your process, if it was
>> multi-thread to do the same type of memory access work.
>> Since it is READ ONLY, there is no contention. If your
>> preempted thread BLOCKED it, then your have contention or
>> even a dead lock - but you are not doing that. You are
>> reading only READ ONLY memory - which will have a maximum
>> access.
>>
>> Now comes a MULTI-CORE, and you have two or more threads,
>> the SPEED is that there is NO CONTEXT SWITCHING - you
>> still may have the same memory access, but that would be
>> no slower if it was single cpu. Your speed comes in less
>> context switching. Understand?
>>
>> In short:
>>
>> single cpu: speed lost due to context switching
>> multi cpu/core: less context switching, more resident
>> time.
>>
>> You can' not think of term of a single thread process
>> because there is no advantage for it on a multi-core/cpu
>> machine.
***
Absolutely! And here's where Peter keeps falling down: he doesn't recognize that a
multithreaded single-process in a multicore machine is a COMPLETELY DIFFERENT problem from
a set of multiple processes!
****
>>
>> The INTEL Multi-Core chips has advanced technology to help
>> multi-threaded applications. Single thread processes can
>> not benefit on multi-core machine. They must be designed
>> for threads to see any benefits. If you want to read up
>> on it, check out the Intel technical documents, like this
>> one:
>>
>>
>> http://download.intel.com/technology/architecture/sma.pdf
>>
>> Specifically read SMA "Smart Memory Access"
>>
>> The bottom line is really simple:
>>
>> You have a single process with a huge memory load. Each
>> instance redundantly create additional huge memory loads
>> and that alone will cause a SYSTEM WIDE performance
>> serious degradation with huge page faulting and context
>> switching delays.
>>
>> You will never get any improvements until you change your
>> memory usage for intelligent sharable and use threads.
>> When done correctly, you will gain benefits provided by
>> the OS and machine.
>>
>> You really need to look at this as a whole:
>>
>> 1 process - with X number of threads
>>
>> vs
>>
>> X single thread Processes.
>>
>> You need to trust us this is NOT the same when the DATA is
>> HUGE!. In the threaded model, it is shared. In the
>> non-threaded model, is redundant for each instance - that
>> will murder you!
>>
>> If you had NO HUGE data requirement, then they become more
>> equal because now its just CODE.
>>
>> Now, it is conceivable that for specific your application,
>> you might realize that X may be 5-10 threads before you
>> see a performance issue that isn't towards your liken.
>>
>> Show me how you are using std::vector with your files, and
>> I will create a simulator for you to PROVE to you how your
>> thinking is all wrong. This simulator will allow you to
>> fine tune it to determine what is your boundary conditions
>> for performance.
>>
>> While you have 20,000 hrs into this WITHOUT even exploring
>> high end thread designs, I have 6 years in Intel RMX
>>
>> http://en.wikipedia.org/wiki/RMX_(operating_system)
>>
>> which was considered one of the early Intel "multi-thread"
>> frameworks we have today and gave me an early nature
>> understanding when NT 3.1 (17 years?) where I have done
>> exclusively high-end multi-threaded commercial server
>> products since then. Count the hours! I can assure you,
>> your single process thinking is wrong.
>>
>> --
>> HLS
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm