From: Tamas K Papp on
Hi,

I am running Monte Carlo calculations which are inherently
parallelizable. I would like to run multiple threads to speed things
up, but I have no experience with threads.

A mock-up of the sequential version looks like this:

(defun run-chains (n)
(let ((result (make-array n)))
(dotimes (i n result)
(setf (aref result i) (chain-generator)))))

For the parallel version, I would like to

1) start CHAIN-GENERATOR's in separate threads for each index,
2) have the result of each end up in the appropriate place in the vector, and
3) have RUN-CHAINS wait for each of the threads to terminate before returning.

I would prefer to use bordeaux-threads if possible, in order to make
my code portable. I have figured out how to do 1-2 with MAKE-THREAD
and closures, but I don't know how to deal with 3. I am assuming that
I need to use locks/mutexes, but I am not familiar with these
concepts. Hints or example code would be appreciated.

Best,

Tamas

From: Pascal Costanza on
On 02/08/2010 12:54, Tamas K Papp wrote:
> Hi,
>
> I am running Monte Carlo calculations which are inherently
> parallelizable. I would like to run multiple threads to speed things
> up, but I have no experience with threads.
>
> A mock-up of the sequential version looks like this:
>
> (defun run-chains (n)
> (let ((result (make-array n)))
> (dotimes (i n result)
> (setf (aref result i) (chain-generator)))))
>
> For the parallel version, I would like to
>
> 1) start CHAIN-GENERATOR's in separate threads for each index,
> 2) have the result of each end up in the appropriate place in the vector, and
> 3) have RUN-CHAINS wait for each of the threads to terminate before returning.
>
> I would prefer to use bordeaux-threads if possible, in order to make
> my code portable. I have figured out how to do 1-2 with MAKE-THREAD
> and closures, but I don't know how to deal with 3. I am assuming that
> I need to use locks/mutexes, but I am not familiar with these
> concepts. Hints or example code would be appreciated.

1) Don't start a thread for each index. Rather, start as many threads as
you have processing units available, and have each of them work
sequentially on a subrange of the vector. Otherwise, the runtime system
will spend too much time on coordinating the various threads, that have
to fight for processing time, and you want to avoid that.

2) Don't have each thread perform an assignment on the shared vector.
Rather report the results back to a coordinating thread, and let it
perform the assignment. Otherwise the vector has to be copied to the
caches of the processing units, and then back to main memory, which is
unnecessary.

3) You can use condition variables for coordinating the threads. Create
a single condition variable, and make the thread that starts all the
worker threads wait on the condition variable. Each worker thread can
notify that condition variable when it's done. If there are n worker
threads, you have to do n waits on the single condition variable.

I find the abstractions bordeaux-threads provides a bit too low-level
and too thin. What LispWorks 6.0 provides is much more convenient. There
I would use mailboxes for the coordination, or maybe barriers. One
option is to implement a mailbox abstraction on top of condition
variables yourself.

Information about the synchronization constructs in LispWorks can be
found here:
http://www.lispworks.com/documentation/lw60/LW/html/lw-275.htm#pgfId-893780

Final note: Parallelization only pays off if what you parallelize is
computationally intensive. If chain-generator doesn't perform a lot of
computation, the overhead of parallelizing and coordinating is probably
higher than just performing the sequential version.



Pascal

--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
From: D Herring on
On 08/02/2010 07:30 AM, Pascal Costanza wrote:
> On 02/08/2010 12:54, Tamas K Papp wrote:
>> I am running Monte Carlo calculations which are inherently
>> parallelizable. I would like to run multiple threads to speed things
>> up, but I have no experience with threads.
....
> 2) Don't have each thread perform an assignment on the shared vector.
> Rather report the results back to a coordinating thread, and let it
> perform the assignment. Otherwise the vector has to be copied to the
> caches of the processing units, and then back to main memory, which is
> unnecessary.

As long as he isn't interleaving the ranges, this shouldn't be
necessary; a direct write is just as efficient and avoids an extra
copy (I don't see why the whole vector would be copied into each
thread's cache). The main thread allocates the problem and answer
space, and each worker thread is responsible for filling its subrange.

You do want to avoid multiple CPUs writing to the same cache lines.
Ensuring consistency requires extra memory bus synchronization, and
can easily slow things down 10x.

i.e. Thread N should process result indices (N+K N+K+1 ... N+K+N-1)
and not (N 2N ... KN).


> I find the abstractions bordeaux-threads provides a bit too low-level
> and too thin. What LispWorks 6.0 provides is much more convenient.
> There I would use mailboxes for the coordination, or maybe barriers.
> One option is to implement a mailbox abstraction on top of condition
> variables yourself.

Funny -- there are many missing features; but I miss the low-level
ones! (e.g. memory barriers, atomic ops, ...) For this problem, a
single condition variable/counting semaphore is quite sufficient.

The above algorithm may be described as "hand me your paper when
you've finished." This inter-person synchronization may be
unnecessary. An alternative is one of the simplest lock-free
algorithms: "last one out, turn off the lights". It slightly inverts
the control structure to avoid locking synchronization on thread exit.
It is employed by the popular C++ shared_ptr.

The main thread initializes the result space with a count of the
threads and a continuation. As each thread exits, it performs an
atomic decrement on the thread count. The thread that decrements the
count to 0 calls the continuation. [The shared_ptr actually
initializes the count to 1; new references increment the count; scope
exits decrement the count.]

- Daniel
From: Pascal Costanza on
On 02/08/2010 16:13, D Herring wrote:
> On 08/02/2010 07:30 AM, Pascal Costanza wrote:
>> On 02/08/2010 12:54, Tamas K Papp wrote:
>>> I am running Monte Carlo calculations which are inherently
>>> parallelizable. I would like to run multiple threads to speed things
>>> up, but I have no experience with threads.
> ...
>> 2) Don't have each thread perform an assignment on the shared vector.
>> Rather report the results back to a coordinating thread, and let it
>> perform the assignment. Otherwise the vector has to be copied to the
>> caches of the processing units, and then back to main memory, which is
>> unnecessary.
>
> As long as he isn't interleaving the ranges, this shouldn't be
> necessary; a direct write is just as efficient and avoids an extra copy
> (I don't see why the whole vector would be copied into each thread's
> cache). The main thread allocates the problem and answer space, and each
> worker thread is responsible for filling its subrange.
>
> You do want to avoid multiple CPUs writing to the same cache lines.
> Ensuring consistency requires extra memory bus synchronization, and can
> easily slow things down 10x.
>
> i.e. Thread N should process result indices (N+K N+K+1 ... N+K+N-1) and
> not (N 2N ... KN).

OK

>> I find the abstractions bordeaux-threads provides a bit too low-level
>> and too thin. What LispWorks 6.0 provides is much more convenient.
>> There I would use mailboxes for the coordination, or maybe barriers.
>> One option is to implement a mailbox abstraction on top of condition
>> variables yourself.
>
> Funny -- there are many missing features; but I miss the low-level ones!
> (e.g. memory barriers, atomic ops, ...) For this problem, a single
> condition variable/counting semaphore is quite sufficient.

OK, OK. ;) Both more high-level and low-level constructs are missing. (I
find condition variables to be inconvenient to use, because they are too
low-level, compared to mailboxes.)

> The above algorithm may be described as "hand me your paper when you've
> finished." This inter-person synchronization may be unnecessary. An
> alternative is one of the simplest lock-free algorithms: "last one out,
> turn off the lights". It slightly inverts the control structure to avoid
> locking synchronization on thread exit. It is employed by the popular
> C++ shared_ptr.
>
> The main thread initializes the result space with a count of the threads
> and a continuation. As each thread exits, it performs an atomic
> decrement on the thread count. The thread that decrements the count to 0
> calls the continuation. [The shared_ptr actually initializes the count
> to 1; new references increment the count; scope exits decrement the count.]

Interesting...

Pascal

--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
From: Tamas K Papp on
On Mon, 02 Aug 2010 10:13:19 -0400, D Herring wrote:

> On 08/02/2010 07:30 AM, Pascal Costanza wrote:
>> On 02/08/2010 12:54, Tamas K Papp wrote:
>>> I am running Monte Carlo calculations which are inherently
>>> parallelizable. I would like to run multiple threads to speed things
>>> up, but I have no experience with threads.
> ...
>> 2) Don't have each thread perform an assignment on the shared vector.
>> Rather report the results back to a coordinating thread, and let it
>> perform the assignment. Otherwise the vector has to be copied to the
>> caches of the processing units, and then back to main memory, which is
>> unnecessary.
>
> As long as he isn't interleaving the ranges, this shouldn't be
> necessary; a direct write is just as efficient and avoids an extra copy
> (I don't see why the whole vector would be copied into each thread's
> cache). The main thread allocates the problem and answer space, and
> each worker thread is responsible for filling its subrange.
>
> You do want to avoid multiple CPUs writing to the same cache lines.
> Ensuring consistency requires extra memory bus synchronization, and can
> easily slow things down 10x.
>
> i.e. Thread N should process result indices (N+K N+K+1 ... N+K+N-1) and
> not (N 2N ... KN).
>
>
>> I find the abstractions bordeaux-threads provides a bit too low-level
>> and too thin. What LispWorks 6.0 provides is much more convenient.
>> There I would use mailboxes for the coordination, or maybe barriers.
>> One option is to implement a mailbox abstraction on top of condition
>> variables yourself.
>
> Funny -- there are many missing features; but I miss the low-level ones!
> (e.g. memory barriers, atomic ops, ...) For this problem, a single
> condition variable/counting semaphore is quite sufficient.
>
> The above algorithm may be described as "hand me your paper when you've
> finished." This inter-person synchronization may be unnecessary. An
> alternative is one of the simplest lock-free algorithms: "last one out,
> turn off the lights". It slightly inverts the control structure to
> avoid locking synchronization on thread exit.
> It is employed by the popular C++ shared_ptr.
>
> The main thread initializes the result space with a count of the threads
> and a continuation. As each thread exits, it performs an atomic
> decrement on the thread count. The thread that decrements the count to
> 0 calls the continuation. [The shared_ptr actually initializes the
> count to 1; new references increment the count; scope exits decrement
> the count.]

Thanks for both answers. I have to admit that this is still over my
head: I am beginning to grasp the idea behind the two algorithms, but
don't know how to implement them. Eg in the "last one out" algorithm,
I think I have to use some kind of counters, but if multiple threads
access these, how do I make sure that they don't mess up? A simple
example would be extremely helpful (preferably using a portable thread
library so I can replicate it on SBCL).

By the way, the total number of independent chains is small (3-7),
each returning a large (10000x100) array of double-floats (which,
AFAIK, is passed as a pointer in CL so it shouldn't be a big deal), so
I am not worried about the cost of assignment.

Thanks,

Tamas