From: William Ahern on
Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
> "Eric Sosman" <esosman(a)ieee-dot-org.invalid> wrote in
> message news:ho5tof$lon$1(a)news.eternal-september.org...
<snip>
> > But if there's another CPU/core/strand/pipeline, it's possible that one
> > processor's stall time could be put to productive use by another if
> > there were multiple execution threads.
<snip>
> is there any possible way that this app is not memory bound that you can
> provide a specific concrete example of?

Your question was answered.

You're hung up on your numbers and preconceived ideas. Your application
could be BOTH memory bound AND able to benefit from multiple CPUs. But it's
nearly impossible to guess without knowing at least the algorithm; more
specifically, the code.

From: William Ahern on
Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
<snip>
> I don't want to spent hundreds of hours making a complex
> process thread-safe just to prove what I knew all along.

You don't really know anything unless you've proved it.

But there is such a thing as rational ignorance. If the benefit of knowing
is worth less than the cost of figuring it out, move on.

If you're merely trying to satisfy your curiosity, you seem to have hit a
brick wall, because there's no easy answer.

From: Peter Olcott on

"William Ahern" <william(a)wilbur.25thandClement.com> wrote in
message news:pe9k77-gjk.ln1(a)wilbur.25thandClement.com...
> Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
>> "Eric Sosman" <esosman(a)ieee-dot-org.invalid> wrote in
>> message news:ho5tof$lon$1(a)news.eternal-september.org...
> <snip>
>> > But if there's another CPU/core/strand/pipeline, it's
>> > possible that one
>> > processor's stall time could be put to productive use
>> > by another if
>> > there were multiple execution threads.
> <snip>
>> is there any possible way that this app is not memory
>> bound that you can
>> provide a specific concrete example of?
>
> Your question was answered.
>
> You're hung up on your numbers and preconceived ideas.
> Your application
> could be BOTH memory bound AND able to benefit from
> multiple CPUs. But it's
> nearly impossible to guess without knowing at least the
> algorithm; more
> specifically, the code.
>

The algorithm is essentially a huge deterministic finite
automaton where the memory required is much larger than the
largest cache, and the memory access pattern is essentially
unpredictable to any cache algorithm.

The essential core processing of this DFA is to lookup in
memory the next location to look up in memory, it does very
little else.


From: Ersek, Laszlo on
In article
<89bdf509-0afa-48c3-a107-67cdaaa27eee(a)t9g2000prh.googlegroups.com>,
David Schwartz <davids(a)webmaster.com> writes:

> But also, it may be memory bandwidth bound because it's single-
> threaded. Assume, for example, the memory access pattern looks like
> this:
>
> X, Y, X+1, Y+1, X+2, Y+2, X+3, Y+3
>
> Imagine the prefetcher cannot see the request for 'X+1' until after it
> processes 'Y'. This can be a worst case scenario, as the memory
> controller keeps opening and closing pages and is unable to balance
> thye channels.
>
> Now imagine you turn it into two threads, one doing this:
> X, X+1, X+2, X+3
> and one doing this:
> Y, Y+1, Y+2, Y+3
>
> Now, the prefetcher (still seeing only one read ahead) will see the
> read for X+1 when it processes the read for X. The net result will be
> that the two threads will run about twice as fast with the same memory
> hardware, even though they are purely memory limited.

mind = blown. Thank you, I'm learning a lot from you.
lacos
From: Scott Lurndal on
"Peter Olcott" <NoSpam(a)OCR4Screen.com> writes:

>I don't want to spent hundreds of hours making a complex
>process thread-safe just to prove what I knew all along.
>
>(1) Machine A performs process B in X minutes.
>(2) Machine C performs process B in X/8 Minutes (800%
>faster)
>(3) The only difference between machine A and machine C
> is that machine C has much faster access to RAM (by
>whatever means).

This is _highly_ speculative. There can be many other
reasons that machine C is faster; clearly the larger L3
footprint will have an effect, but the processor internals
could also have reduced the CPI (cycles per instruction) or
widened the ALU for more superscaler operations; even if the
gross clock speed doesn't appear to differ between A & C.

The current crop of Corei7/Athlon64/Opteron all have memory
controllers on-chip and have eliminated the FSB, which also
improves memory throughput significantly on multisocket
configurations. Upcoming intel processors will have multiple
DRAM channels per memory controller, providing even more
bandwidth.

scott