From: Pascal Costanza on
On 03/08/2010 12:50, Tamas K Papp wrote:
> On Mon, 02 Aug 2010 10:13:19 -0400, D Herring wrote:
>
>> On 08/02/2010 07:30 AM, Pascal Costanza wrote:
>>> On 02/08/2010 12:54, Tamas K Papp wrote:
>>>> I am running Monte Carlo calculations which are inherently
>>>> parallelizable. I would like to run multiple threads to speed things
>>>> up, but I have no experience with threads.
>> ...
>>> 2) Don't have each thread perform an assignment on the shared vector.
>>> Rather report the results back to a coordinating thread, and let it
>>> perform the assignment. Otherwise the vector has to be copied to the
>>> caches of the processing units, and then back to main memory, which is
>>> unnecessary.
>>
>> As long as he isn't interleaving the ranges, this shouldn't be
>> necessary; a direct write is just as efficient and avoids an extra copy
>> (I don't see why the whole vector would be copied into each thread's
>> cache). The main thread allocates the problem and answer space, and
>> each worker thread is responsible for filling its subrange.
>>
>> You do want to avoid multiple CPUs writing to the same cache lines.
>> Ensuring consistency requires extra memory bus synchronization, and can
>> easily slow things down 10x.
>>
>> i.e. Thread N should process result indices (N+K N+K+1 ... N+K+N-1) and
>> not (N 2N ... KN).
>>
>>
>>> I find the abstractions bordeaux-threads provides a bit too low-level
>>> and too thin. What LispWorks 6.0 provides is much more convenient.
>>> There I would use mailboxes for the coordination, or maybe barriers.
>>> One option is to implement a mailbox abstraction on top of condition
>>> variables yourself.
>>
>> Funny -- there are many missing features; but I miss the low-level ones!
>> (e.g. memory barriers, atomic ops, ...) For this problem, a single
>> condition variable/counting semaphore is quite sufficient.
>>
>> The above algorithm may be described as "hand me your paper when you've
>> finished." This inter-person synchronization may be unnecessary. An
>> alternative is one of the simplest lock-free algorithms: "last one out,
>> turn off the lights". It slightly inverts the control structure to
>> avoid locking synchronization on thread exit.
>> It is employed by the popular C++ shared_ptr.
>>
>> The main thread initializes the result space with a count of the threads
>> and a continuation. As each thread exits, it performs an atomic
>> decrement on the thread count. The thread that decrements the count to
>> 0 calls the continuation. [The shared_ptr actually initializes the
>> count to 1; new references increment the count; scope exits decrement
>> the count.]
>
> Thanks for both answers. I have to admit that this is still over my
> head: I am beginning to grasp the idea behind the two algorithms, but
> don't know how to implement them. Eg in the "last one out" algorithm,
> I think I have to use some kind of counters, but if multiple threads
> access these, how do I make sure that they don't mess up? A simple
> example would be extremely helpful (preferably using a portable thread
> library so I can replicate it on SBCL).
>
> By the way, the total number of independent chains is small (3-7),
> each returning a large (10000x100) array of double-floats (which,
> AFAIK, is passed as a pointer in CL so it shouldn't be a big deal), so
> I am not worried about the cost of assignment.

Here is a sketch of the "hand me your paper" solution in LispWorks:

(defmacro parallel-dotimes ((var iterations &optional result) &body body)
`(parallel-dotimes-f
,iterations
(lambda (,var)
(declare (ignorable ,var))
,@body)
(lambda (,var)
(declare (ignorable ,var))
,result)))

(defconstant +nof-processors+ 4)

(defun parallel-dotimes-f (iterations loopf resultf)
(if (< iterations +nof-processors+)
(dotimes (i iterations (funcall resultf i))
(funcall loopf i))
(loop with range-size = (ceiling iterations +nof-processors+)
with mailbox = (mp:make-mailbox)
for start by range-size
for end = (+ start range-size)
for count
while (< start iterations) do
(mp:process-run-function
"parallel-dotimes" '()
(lambda (start end)
(loop for i from start below end do (funcall loopf i))
(mp:mailbox-send mailbox t))
start (min end iterations))
finally
(loop repeat count do (mp:mailbox-read mailbox))
(return (funcall resultf iterations)))))

(defun run-chains (n)
(let ((result (make-array n)))
(parallel-dotimes (i n result)
(setf (aref result i) (random 100000)))))

It's a good idea to abstract away the parallel version of dotimes, so
you can later change it to a different parallelization strategy. If
there are more places in your code where you want to use parallel
execution, it's probably very worthwhile to factor out the creation of
the processes and have them available as worker processes to which you
can "send" work as needed.

Something like http://common-lisp.net/project/eager-future/ or
http://userweb.cs.utexas.edu/users/ragerdl/parallelism/ already does
this for you, for example.


Pascal

--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
From: D Herring on
On 08/03/2010 06:50 AM, Tamas K Papp wrote:

> I think I have to use some kind of counters, but if multiple threads
> access these, how do I make sure that they don't mess up? A simple
> example would be extremely helpful (preferably using a portable thread
> library so I can replicate it on SBCL).

Unfortunately, the required operations are not in any portable library
I'm aware of. Many CL implementations lack them entirely. That said,
here's how its done elsewhere.


Most CPUs provide a "compare and swap" operation. It is beautiful
once you understand how to use it.

Here's the GCC builtin command.
http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval)

The function takes three arguments: a place, the value you think it
contains, and a new value to put there. If the oldval is in fact the
current value when the function is called, then it will be replaced by
newval.

The bool form returns a flag indicating whether the assignment was
made; the other form returns the value observed by the function.


The take home message is: You can atomically make a change,
conditional on your thread having an accurate view of the world.

So an atomic addition may be implemented as something like the
following unoptimized code.

int atomic_add(int *place, int increment)
{
int x=*place;
while(true)
{
int sum=x+increment;
int y=__sync_val_compare_and_swap(place, x, sum);
if(y==x) return y;
x=y;
}
}


Many CPUs also provide an atomic increment and/or decrement operator.
Here you don't care what value currently exists; you just want it to
be changed correctly. Again, GCC has builtins for this.
type __sync_fetch_and_add (type *ptr, type value)
type __sync_fetch_and_sub (type *ptr, type value)

These are roughly equivalent to the above atomic_add; return the
current value of the place, and modify it correctly.


These two operations are the key building blocks in "lock-free"
algorithms. Since they have direct hardware support, they are
generally 10x faster than constructs like mutexes and semaphores
(which generally involve kernel context switching on top of internal
atomic ops). However, a CAS is generally 10x slower than simply
setting the memory location.

Later,
Daniel
From: Pascal Costanza on
On 05/08/2010 06:21, D Herring wrote:
> On 08/03/2010 06:50 AM, Tamas K Papp wrote:
>
>> I think I have to use some kind of counters, but if multiple threads
>> access these, how do I make sure that they don't mess up? A simple
>> example would be extremely helpful (preferably using a portable thread
>> library so I can replicate it on SBCL).
>
> Unfortunately, the required operations are not in any portable library
> I'm aware of. Many CL implementations lack them entirely.

Here is an attempt at doing this in LispWorks 6.0:

(defmacro parallel-dotimes
((var iterations &optional result) &body body)
`(parallel-dotimes-f
,iterations
(lambda (,var)
(declare (ignorable ,var))
,@body)
(lambda (,var)
(declare (ignorable ,var))
,result)))

(defconstant +nof-processors+ 4)

(defun parallel-dotimes-f (iterations loopf resultf)
(if (< iterations +nof-processors+)
(dotimes (i iterations (funcall resultf i))
(funcall loopf i))
(loop with range-size = (ceiling iterations +nof-processors+)
with mailbox = (mp:make-mailbox)
with count = (vector +nof-processors+)
for start by range-size
for end = (+ start range-size)
while (< start iterations) do
(mp:process-run-function
"parallel-dotimes" '()
(lambda (start end)
(loop for i from start below end do (funcall loopf i))
(when (zerop (sys:atomic-decf (svref count 0)))
(mp:mailbox-send mailbox t)))
start (min end iterations))
finally
(mp:mailbox-read mailbox)
(return (funcall resultf iterations)))))

(defun run-chains (n)
(let ((result (make-array n)))
(parallel-dotimes (i n result)
(setf (aref result i) (random 100000)))))

This uses atomic-decf, but compare-and-swap is also available in
LispWorks. At the moment, these operations only work on global
variables, and components of vectors, structs, etc., but not on lexical
variables. That's why the count variable is boxed in a vector.


Pascal

--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
From: D Herring on
On 08/05/2010 07:40 AM, Pascal Costanza wrote:
> On 05/08/2010 06:21, D Herring wrote:
>> On 08/03/2010 06:50 AM, Tamas K Papp wrote:
>>
>>> I think I have to use some kind of counters, but if multiple threads
>>> access these, how do I make sure that they don't mess up? A simple
>>> example would be extremely helpful (preferably using a portable thread
>>> library so I can replicate it on SBCL).
>>
>> Unfortunately, the required operations are not in any portable library
>> I'm aware of. Many CL implementations lack them entirely.
>
> Here is an attempt at doing this in LispWorks 6.0:

Looks pretty good. It could be a drop-in upgrade for existing code
since it maintains a "single-threaded flavor". One thread starts
parallel-dotimes and waits for the result. Is that thread special, or
can it store a continuation to be executed by whatever thread happens
to finish last?


> This uses atomic-decf, but compare-and-swap is also available in
> LispWorks.

That's good. Atomic-decf should be more efficient than the
compare-and-swap loop; I just wrote that as an illustration.


> At the moment, these operations only work on global
> variables, and components of vectors, structs, etc., but not on
> lexical variables. That's why the count variable is boxed in a vector.

I think that will be true for any CL implementation; memory used by
multiple threads should be on the heap, thus it can't be stored in a
single thread's stack.


BTW, there are other tricks to play. Thread creation can be
expensive; it can be cheaper to populate a "thread pool" and feed it
tasks. Also, if one PARALLEL-DOTIMES calls code containing another,
then too many threads may be spawned. Finding efficient ways to slice
a problem can be difficult.

One trick I heard (IIRC, it was explained in the Sun Fortress
presentation at the Boston Lisp Meeting) was called something like
"task stealing". The idea was to have a thread pool milling about a
central task list. When a thread starts to do something which may
take a long time, it keeps a reasonable chunk for itself (e.g. by
bisection) and pushes the rest onto the task list. An idle thread may
then take that remaining task, possibly subdividing it further.


Later,
Daniel
From: Pascal Costanza on
On 06/08/2010 04:01, D Herring wrote:
> On 08/05/2010 07:40 AM, Pascal Costanza wrote:
>> On 05/08/2010 06:21, D Herring wrote:
>>> On 08/03/2010 06:50 AM, Tamas K Papp wrote:
>>>
>>>> I think I have to use some kind of counters, but if multiple threads
>>>> access these, how do I make sure that they don't mess up? A simple
>>>> example would be extremely helpful (preferably using a portable thread
>>>> library so I can replicate it on SBCL).
>>>
>>> Unfortunately, the required operations are not in any portable library
>>> I'm aware of. Many CL implementations lack them entirely.
>>
>> Here is an attempt at doing this in LispWorks 6.0:
>
> Looks pretty good. It could be a drop-in upgrade for existing code since
> it maintains a "single-threaded flavor". One thread starts
> parallel-dotimes and waits for the result. Is that thread special, or
> can it store a continuation to be executed by whatever thread happens to
> finish last?

I guess the continuation could be executed by whatever thread finishes
last, but I'm not sure if that buys us anything. We have to wait for
that final result anyway.

We also always have to watch out for the dynamic environment in Common
Lisp, so I prefer to have what you call a "single-threaded flavor."

>> At the moment, these operations only work on global
>> variables, and components of vectors, structs, etc., but not on
>> lexical variables. That's why the count variable is boxed in a vector.
>
> I think that will be true for any CL implementation; memory used by
> multiple threads should be on the heap, thus it can't be stored in a
> single thread's stack.

If a closure is passed around, the lexical variables are also stored on
the heap. Two or more closures closing over the same environments could
be executed in different threads, and then compare-and-swap and friends
become relevant for lexical variables.

> BTW, there are other tricks to play. Thread creation can be expensive;
> it can be cheaper to populate a "thread pool" and feed it tasks. Also,
> if one PARALLEL-DOTIMES calls code containing another, then too many
> threads may be spawned. Finding efficient ways to slice a problem can be
> difficult.

Indeed.

> One trick I heard (IIRC, it was explained in the Sun Fortress
> presentation at the Boston Lisp Meeting) was called something like "task
> stealing". The idea was to have a thread pool milling about a central
> task list. When a thread starts to do something which may take a long
> time, it keeps a reasonable chunk for itself (e.g. by bisection) and
> pushes the rest onto the task list. An idle thread may then take that
> remaining task, possibly subdividing it further.

The actual term is "work stealing". It was originally used in MultiLisp
and QLisp, and then picked up by Cilk and Cilk++. The Cilk and Cilk++
versions have what seems to be the most mature versions of work
stealing, and they have backed their approach with a series of excellent
papers that show why work stealing is indeed an extremely good approach
for parallelization.

I'm currently experimenting with implementations of that idea in
LispWorks, and hope to be able to publish a library for doing this...


Pascal

--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/