From: William Ahern on
Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
> The algorithm is essentially a huge deterministic finite
> automaton where the memory required is much larger than the
> largest cache, and the memory access pattern is essentially
> unpredictable to any cache algorithm.
>
> The essential core processing of this DFA is to lookup in
> memory the next location to look up in memory, it does very
> little else.

I don't have sufficient experience in this particular area to be of much
help. But assuming you're just executing a single DFA, and if I was in your
shoes, I see two non-exclusive options: (1) research using ACM Portal
(subscription required) and CiteSeerX parallel DFA algorithms; and (2) look
for ways prefetching--as mentioned elsethread--might help, including using a
separate dedicated thread to prefetch or otherwise make sure the memory is
being fully utilized.

To reiterate, just because it's memory bound doesn't mean it's using memory
optimally. Multiple processors may be helpful. Indeed, a prefetcher is just
another processor, either a general purpose CPU being used to prefetch, or
dedicated hardware executing in parallel to prefetch.
From: Scott Lurndal on
William Ahern <william(a)wilbur.25thandClement.com> writes:
>Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
>> The algorithm is essentially a huge deterministic finite
>> automaton where the memory required is much larger than the
>> largest cache, and the memory access pattern is essentially
>> unpredictable to any cache algorithm.
>>
>> The essential core processing of this DFA is to lookup in
>> memory the next location to look up in memory, it does very
>> little else.
>
>I don't have sufficient experience in this particular area to be of much
>help. But assuming you're just executing a single DFA, and if I was in your
>shoes, I see two non-exclusive options: (1) research using ACM Portal
>(subscription required) and CiteSeerX parallel DFA algorithms; and (2) look
>for ways prefetching--as mentioned elsethread--might help, including using a
>separate dedicated thread to prefetch or otherwise make sure the memory is
>being fully utilized.
>
>To reiterate, just because it's memory bound doesn't mean it's using memory
>optimally. Multiple processors may be helpful. Indeed, a prefetcher is just
>another processor, either a general purpose CPU being used to prefetch, or
>dedicated hardware executing in parallel to prefetch.

One of the things that really kills this type of application is TLB
refills. Since each TLB refill for a 4k page results in several
memory accesses (3 or 4 depending on 32-bit or 64-bit page tables),
if you get a TLB miss on each reference (a characteristic of such
applications), you're performing 4 or 5 memory accesses to retreive
the desired cache line. Granted, the higher level page table
entries _may_ still be in the processor cache, but the lower level
are very unlikely to be.

To improve this, code your application to use larger page sizes (4MB
in 32-bit systems, 2MB in 64-bit, 1GB in Opteron). If such page
sizes are supported by the OS (Linux supports 4/2MB pages only
for applications), you can improve your application performance
considerably. 1GB pages are best if you can, since a TLB
fill only needs to do two references, one of which will be cached.
I don't think any of the distro's support 1GB pages yet, not sure
about the kernel.

On linux, look for hugetlbfs (mmap) or the SHM_HUGE flag for shmat.

One caveat; because intel and amd are lazy[*], they require large pages
to be naturally aligned - this means that large pages need to be reserved
and allocated at boot time, or early after boot, to ensure that sufficient
aligned contiguous physical memory is available.

scott

[*] Perhaps not fair, but they can currently just wire-or the address
bits within the page and the physical address in the TLB to generate
the resulting address. If they allowed misaligned large pages, they'd
need to do an add instead, which may perform less well, but makes
large pages _much_ more usable.
From: Peter Olcott on

"William Ahern" <william(a)wilbur.25thandClement.com> wrote in
message news:3nhn77-cnl.ln1(a)wilbur.25thandClement.com...
> Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
>> The algorithm is essentially a huge deterministic finite
>> automaton where the memory required is much larger than
>> the
>> largest cache, and the memory access pattern is
>> essentially
>> unpredictable to any cache algorithm.
>>
>> The essential core processing of this DFA is to lookup in
>> memory the next location to look up in memory, it does
>> very
>> little else.
>
> I don't have sufficient experience in this particular area
> to be of much
> help. But assuming you're just executing a single DFA, and
> if I was in your
> shoes, I see two non-exclusive options: (1) research using
> ACM Portal
> (subscription required) and CiteSeerX parallel DFA
> algorithms; and (2) look
> for ways prefetching--as mentioned elsethread--might help,
> including using a
> separate dedicated thread to prefetch or otherwise make
> sure the memory is
> being fully utilized.
>
> To reiterate, just because it's memory bound doesn't mean
> it's using memory
> optimally. Multiple processors may be helpful. Indeed, a
> prefetcher is just
> another processor, either a general purpose CPU being used
> to prefetch, or
> dedicated hardware executing in parallel to processes.

I was inspired by a member of another group to write this
code that tests the memory access performance of a machine.
I ran this code as four separate processes and because
execution time (on a quad core processor) was only slightly
degraded over the time of a single process, I reasonably
concluded that I am likely not experiencing a memory
bandwidth bottleneck. All that this code does is move to
distant places in a very large std::vector.

Slight modifications may be required to run under Unix.

#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <time.h>

#define uint32 unsigned int

const uint32 repeat = 2;
//const uint32 size = 488000000;
const uint32 size = 100000000;
std::vector<uint32> Data;



void Process() {
clock_t finish;
clock_t start = clock();
double duration;
uint32 num;
for (uint32 r = 0; r < repeat; r++)
for (uint32 i = 0; i < size; i++)
num = Data[num];
finish = clock();
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf("%4.2f Seconds\n", duration);
}



int main() {
printf("Size in bytes--->%d\n", size * 4);
Data.reserve(size);
for (uint32 N = 0; N < size; N++) {
uint32 Random = rand() * rand();
Random %= size;
// printf("Random--->%d\n", Random);
Data.push_back( Random );
}

char N;
printf("Hit any key to Continue:");
scanf("%c", &N);

Process();

return 0;
}




From: Moi on
On Mon, 22 Mar 2010 12:10:48 -0500, Peter Olcott wrote:

> "William Ahern" <william(a)wilbur.25thandClement.com> wrote in message
> news:pe9k77-gjk.ln1(a)wilbur.25thandClement.com...
>> Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
>>> "Eric Sosman" <esosman(a)ieee-dot-org.invalid> wrote in message
>>> news:ho5tof$lon$1(a)news.eternal-september.org...
>> <snip>
>>> > But if there's another CPU/core/strand/pipeline, it's possible that
>>> > one
>>> > processor's stall time could be put to productive use by another if
>>> > there were multiple execution threads.
>> <snip>
>>> is there any possible way that this app is not memory bound that you
>>> can
>>> provide a specific concrete example of?
>>
>> Your question was answered.
>>
>> You're hung up on your numbers and preconceived ideas. Your application
>> could be BOTH memory bound AND able to benefit from multiple CPUs. But
>> it's
>> nearly impossible to guess without knowing at least the algorithm; more
>> specifically, the code.
>>
>>
> The algorithm is essentially a huge deterministic finite automaton where
> the memory required is much larger than the largest cache, and the
> memory access pattern is essentially unpredictable to any cache
> algorithm.
>
> The essential core processing of this DFA is to lookup in memory the
> next location to look up in memory, it does very little else.

The obvious way to accelerate this program would IMHO be to 'cluster' the
"frequently occuring together" states to adjacent positions in core.
(you only need to renumber the states)

You could start by putting the "hard hitters" together. Assuming a 90/10
rule, that would buy you a factor of about four in memory bandwidth pressure.
Also, keeping the structures aligned (modulo) slotsize could save you some
cache misses.

/2cts

AvK