From: Mike on

<nmm1(a)cam.ac.uk> wrote in message
news:hhapct$4vr$1(a)smaug.linux.pwf.cam.ac.uk...
| In article <7qSdna5HxYemf6XWnZ2dnUVZ_tudnZ2d(a)bestweb.net>,
| Mayan Moudgill <mayan(a)bestweb.net> wrote:
| > >
| > > Consider a program that has a large array of one-byte flags, and
needs
| > > to update that array in parallel. No single flag will be
accessed by
| > > two different threads without synchronisation, but the adjacency
| > > properties are such that it cannot practically be separated into
one
| > > array for each thread. You can get the same issue by splitting
up a
| > > very large string (think DNA or whatever).
| >
| >This kind of computation will have performance issues on single
threaded
| >code as well - for large enough array sizes, you'll get horrible
cache
| >hit rates. Potentially *EVERY* update will cause a cache miss. In
that
| >case, going multi-core *COULD* give you super-linear _speed-up_ due
to
| >increased aggreate cache-size.
|
| No, no, not at all. SOME such code has the properties you describe;
| other code has quite good caching properties, but relies on a few
| byte-separation accesses. It's the latter that are the killer. Not
| merely are they currently legal (almost recommended) and efficient,
| but tracking down a small number of 'aliasing' problems if often a
| nightmare. MTBF times of hours, with no two successive failures
| showing up in the same way :-(
|
| If the failure were trapped by hardware, there wouldn't be a
problem,
| but all it does is to give wrong answers, non-repeatably.


Here is an idea.

What if there were some mechanism to be sure that all threads that
update contiguous bytes that form a single cash line must run on a
single cpu. However when a large array or structure spans multiple
cash lines additional threads and cpu's are allocated.

Mike




From: nmm1 on
In article <fLOdnQLzHcrvZKXWnZ2dnUVZ_qOdnZ2d(a)earthlink.com>,
Mike <mike(a)mike.net> wrote:
>
>What if there were some mechanism to be sure that all threads that
>update contiguous bytes that form a single cash line must run on a
>single cpu. However when a large array or structure spans multiple
>cash lines additional threads and cpu's are allocated.

It doesn't help. This is a language problem, and not a hardware one.
A compiler could not make use of any such mechanism unless the
language provides the information for it to do so.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:
> In article <7qSdna5HxYemf6XWnZ2dnUVZ_tudnZ2d(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>>Consider a program that has a large array of one-byte flags, and needs
>>>to update that array in parallel. No single flag will be accessed by
>>>two different threads without synchronisation, but the adjacency
>>>properties are such that it cannot practically be separated into one
>>>array for each thread. You can get the same issue by splitting up a
>>>very large string (think DNA or whatever).
>>
>>This kind of computation will have performance issues on single threaded
>>code as well - for large enough array sizes, you'll get horrible cache
>>hit rates. Potentially *EVERY* update will cause a cache miss. In that
>>case, going multi-core *COULD* give you super-linear _speed-up_ due to
>>increased aggreate cache-size.
>
>
> No, no, not at all. SOME such code has the properties you describe;
> other code has quite good caching properties, but relies on a few
> byte-separation accesses.

Sorry, don't understand. Can you give a more concrete example? What do
you mean by "byte-separation accesses"?

I was assuming that you were trying to parallelize:

for( i = 0; i < N; i++ ) {
B[f(i)] = g(i, ...)
}

where B is the byte array, f(i) is some arbitrary scatter vector or
computation, and g() is some function.

If g() is dependent on some values of B[], then I would have thought
that, for irregular f(), parallelization would not pay off; g(M, ...)
would be dependent on all i, 0 <= i < M.

For g() not dependent on B[], or not dependent on the loop computed
values of B[], there is, of course, no need for synchronization inside
the loop; a barrier after (and possibly before) the loop suffices. The
major performance inhibitor is the ping-ponging of cache lines. But, as
I said before, this will probably be correlated with the behavior on the
cache of the non-parallelized code; if there aren't too many misses in
the serial code, then block parallelization of the form:
forall p
for( i = p*N/P; i <(p+1)*N/P; i++ )
B[f(i)] = g(i, ....)
SHOULD result in equivalent behavior on the parallel code.

The last case of interest is g() dependent on B[], but the dependencies
& f() are sufficiently regular that we can figure out a parallel
iteration mapping with a small enough number of synchronization points
to make it practical to actually do the loop in parallel. If I read your
statement correctly, your claim is that in that case, we will end up
with line ping-ponging because of byte flag sharing.

I am somewhat skeptical of this; given that there is sufficient
regularity in the various index expressions to admit profitable
parallelization, I suspeect that there will be enough regularity to
permit restructuring (some variant of tiling, striding, skewing etc.) of
the iteration order to decrease the ping-pong behavior.

A concrete example would be nice to have here.
From: nmm1 on
In article <ubGdnTyFcsjdnaTWnZ2dnUVZ_sGdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>Sorry, don't understand. Can you give a more concrete example? What do
>you mean by "byte-separation accesses"?

The example you give is one such, but consider this:

unsigned char flags[1000000000];
int offsets[num_threads];
/* This passes different sections of the flags to different threads,
for updating. ]
fred(&flags[offsets[thread]],offsets[thread+1]-offsets[thread]);

In the case where the offsets are determined by the algorithm, and
cannot conveniently be made multiples of the cache line size. Even
when they can be, it is at best disgusting to have to build in such
a system-specific value just to ensure that a program is correct.
That is the absolutely antithesis of portability!

>I am somewhat skeptical of this; given that there is sufficient
>regularity in the various index expressions to admit profitable
>parallelization, I suspeect that there will be enough regularity to
>permit restructuring (some variant of tiling, striding, skewing etc.) of
>the iteration order to decrease the ping-pong behavior.

Another, and VERY important, example is adaptive grids. You start
off a grid as nicely and cleanly separated, but points near the
boundaries drift into other sections over time. As restructuring
is far more expensive than a normal iteration, you do it only when
the number of drifted points reaches a threshhold (say 1-5%). The
actual number of cross-section accesses is small enough that it
doesn't impact the cache hit ratio significantly.

In case you aren't into that area, think finite element analysis of
moving structures - say a rotating jet engine, or way that a tsunami
breaks over a protective structure (when the grid is of the water,
not the ground).


Regards,
Nick Maclaren.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:
> In article <ubGdnTyFcsjdnaTWnZ2dnUVZ_sGdnZ2d(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>Sorry, don't understand. Can you give a more concrete example? What do
>>you mean by "byte-separation accesses"?
>
>
> The example you give is one such, but consider this:
>
> unsigned char flags[1000000000];
> int offsets[num_threads];
> /* This passes different sections of the flags to different threads,
> for updating. ]
> fred(&flags[offsets[thread]],offsets[thread+1]-offsets[thread]);
>
> In the case where the offsets are determined by the algorithm, and
> cannot conveniently be made multiples of the cache line size. Even
> when they can be, it is at best disgusting to have to build in such
> a system-specific value just to ensure that a program is correct.
> That is the absolutely antithesis of portability!
>

Sorry; don't see how this suffers from any kind of degradation. Could
you elaborate?

So far this looks like the case where thread T handles the subarray of
flags flags[offset[T]:offset[T+1]-1], which is pretty straightforward.
Is the degradation because the subarrays for T are unbalanced?