Proposing a new module: Parallel::Loops [Perl]

Prev: FAQ 5.1 How do I flush/unbuffer an output filehandle? Why must I do this?
Next: what cpu core is running the script?

From: Peter Valdemar Mørch on 26 Jun 2010 15:25

Commenting on Ben's post out of order:
> > $pl->async {
> > bla_bla_bla();
> > }
> > This syntax could easily co-exist with the $pl->foreach and $pl->while
> > syntax.
>
> Not like that it can't, since methods don't have prototypes.
....
> If you want a method call it would have to look like
>
> $pl->async(sub { ... });

Yes you're right, of course.

> > I'm worried though that people will forget to call $pl->joinAll()!
>
> Stick it in DESTROY.

I don't see how that would help. I'm thinking of a user writing
something like:

$pl->share(\%results);
foreach (0..4) {
$pl->async(sub { $results{$_} = foobar($_) } );
}
$pl->joinAll();
useResults(\%results);

In this case, at the time of the call to useResults, %results will
contain the finished results from all forked processes because $pl-
>joinAll() waits for them all to finish. If $pl->joinAll() doesn't get
called, the user will most likely see an empty %results. I don't see
how DESTROY comes in to play here or could help.

> They're not global. %output can be scoped as tightly as you like around
> the async call: async takes a closure, so it will make available (either
> shared or as copies) any lexicals in scope at the time. (This is why $_
> won't work: it isn't a lexical.)

I think I haven't made my concern clear. Is it possible to do:

my %resultsForCalc1 : Shared($pl1);

and have the sharing associated with a particular Parallel::Loops
instance (so my attribute handler gets a reference to $pl1, not the
string '$pl1')?

If so, cool. Don't read any further, I'm satisified (BTW, How?). If
not, lets say one does this:

my %resultsForCalc1 : Shared;
my $pl1 = Parallel::Loops->new(4);
$pl1->foreach([0..9], sub {
$resultsForCalc11{$_} = doSomething($_);
}
useResults(\%resultsForCalc1);

# Block above duplicated, just s/1/2/g
my %resultsForCalc2 : Shared;
my $pl2 = Parallel::Loops->new(4);
$pl1->foreach([0..9], sub {
$resultsForCalc12{$_} = doSomething($_);
}
useResults(\%resultsForCalc1);

Wouldn't the list ( \%resultsForCalc1, \%resultsForCalc2 ) have to be
global? How would I/perl keep track of that the user only wants to
share %resultsForCalc1 in the first calculation and only
%resultsForCalc2 in the second?

By the way, how would one avoid that %foo gets handled as shared in
the following case, since it has gone out of scope?

{
my %foo : Shared;
}
my %resultsForCalc1 : Shared;
my $pl1 = Parallel::Loops->new(4);
$pl1->foreach([0..9], sub {
$resultsForCalc11{$_} = doSomething($_);
}
useResults(\%resultsForCalc1);

I don't (yet?) see how I can detect which of the hashes with the
"Shared" attribute that are in scope at the time of the $pl1-
>foreach() call.

But even if I could detect which of all the shared hashes that were in
scope "now", that may not be what the user wants. There could be other
reasons that the user wants %resultsForCalc1 (from way above) in an
outer scope and not have it shared in some of the calculations where
it happens to be in scope.

Perhaps we're getting a little off-topic here, but now I'm curious
about the attributes business! ;-)

Peter

From: Ben Morrow on 26 Jun 2010 16:52

Quoth =?ISO-8859-1?Q?Peter_Valdemar_M=F8rch?= <4ux6as402(a)sneakemail.com>:
> Commenting on Ben's post out of order:
>
> > > I'm worried though that people will forget to call $pl->joinAll()!
> >
> > Stick it in DESTROY.
>
> I don't see how that would help. I'm thinking of a user writing
> something like:
>
> $pl->share(\%results);
> foreach (0..4) {
> $pl->async(sub { $results{$_} = foobar($_) } );
> }
> $pl->joinAll();
> useResults(\%results);
>
> In this case, at the time of the call to useResults, %results will
> contain the finished results from all forked processes because $pl-
> >joinAll() waits for them all to finish. If $pl->joinAll() doesn't get
> called, the user will most likely see an empty %results. I don't see
> how DESTROY comes in to play here or could help.

Well, if the user wrote

my %results;
{
my $pl = Parallel::Loops->new;
$pl->share(\%results);
$pl->async(sub { $results{$_} = foobar($_) })
for 0..4;
}
useResults \%results;

then a call to ->joinAll in DESTROY would ensure it was called. Since
variables (particularly those containing potentially-expensive object,
like $pl) should be minimally-scoped, this would be the correct way to
write that code.

> > They're not global. %output can be scoped as tightly as you like around
> > the async call: async takes a closure, so it will make available (either
> > shared or as copies) any lexicals in scope at the time. (This is why $_
> > won't work: it isn't a lexical.)
>
> I think I haven't made my concern clear. Is it possible to do:
>
> my %resultsForCalc1 : Shared($pl1);
>
> and have the sharing associated with a particular Parallel::Loops
> instance (so my attribute handler gets a reference to $pl1, not the
> string '$pl1')?

Not easily. Apart from anything else, attribute declarations are
processed at compile-time, before your objects have been constructed.

I was still looking at the question 'why aren't you simply using
forks?'. forks handles all this for you.

> If so, cool. Don't read any further, I'm satisified (BTW, How?). If
> not, lets say one does this:
>
> my %resultsForCalc1 : Shared;
> my $pl1 = Parallel::Loops->new(4);
> $pl1->foreach([0..9], sub {
> $resultsForCalc11{$_} = doSomething($_);
> }
> useResults(\%resultsForCalc1);
>
> # Block above duplicated, just s/1/2/g
> my %resultsForCalc2 : Shared;
> my $pl2 = Parallel::Loops->new(4);
> $pl1->foreach([0..9], sub {
> $resultsForCalc12{$_} = doSomething($_);
> }
> useResults(\%resultsForCalc1);
>
> Wouldn't the list ( \%resultsForCalc1, \%resultsForCalc2 ) have to be
> global?

When you say 'global' you mean 'shared in all P::L instances', right?
Is this a problem? Since (presumably) you would be tying the variable in
the attr handler, just make sure DESTROY and UNTIE for the tied object
take it off the current list. That way, when the shared variable goes
out of scope it will no longer be considered a candidate for sharing.

(You don't even need to do that if you just weaken the refs in your
master list. Perl will replace any that go out of scope with undef.)

I don't know how P::L deals with copying the results back. Presumably
you have no idea whether a variable has been modified in the sub-process
or not? What do you do if two sub-processes change the same shared var
in different ways?

> How would I/perl keep track of that the user only wants to
> share %resultsForCalc1 in the first calculation and only
> %resultsForCalc2 in the second?
>
> By the way, how would one avoid that %foo gets handled as shared in
> the following case, since it has gone out of scope?
>
> {
> my %foo : Shared;
> }
> my %resultsForCalc1 : Shared;
> my $pl1 = Parallel::Loops->new(4);
> $pl1->foreach([0..9], sub {
> $resultsForCalc11{$_} = doSomething($_);
> }
> useResults(\%resultsForCalc1);
>
> I don't (yet?) see how I can detect which of the hashes with the
> "Shared" attribute that are in scope at the time of the $pl1-
> >foreach() call.
>
> But even if I could detect which of all the shared hashes that were in
> scope "now", that may not be what the user wants. There could be other
> reasons that the user wants %resultsForCalc1 (from way above) in an
> outer scope and not have it shared in some of the calculations where
> it happens to be in scope.
>
> Perhaps we're getting a little off-topic here, but now I'm curious
> about the attributes business! ;-)

Not OT at all.

FWIW, I would cast this API rather differently. You don't seem to be
trying to emulate the forks API of 'you can do anything you like', but
instead restricting yourself to iterating over a list. In that case, why
not have the API like

my $PL = Parallel::Loops->new(sub { dosomething($_) });
my %results = $PL->foreach(0..9);

No need for any tying, and there's no chance of forgetting the
'->joinAll' since you don't get the results until it's been done. (The
subproc that runs the closure will, of course, get a COW copy of
anything currently in scope, so there's no need to worry about sharing
'read-only' data.)

Ben

From: Peter Valdemar Mørch on 28 Jun 2010 04:05

On Jun 26, 10:52 pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
> I was still looking at the question 'why aren't you simply using
> forks?'. forks handles all this for you.

Well, because I don't want the forks API. I want the foreach
syntax. :-) The main reason is that it is so much easier to write and
read later on.

I could've implemented it using forks, but I didn't. Forks _is_
mentioned in the "SEE ALSO" section so users have a chance to explore
alternatives.

> When you say 'global' you mean 'shared in all P::L instances', right?

Yes.

> Is this a problem?

A little bit. To me, that speaks in favor of

my %output;
$pl->share(\%output)

over

my %output : Shared;

(apart from the fact that $pl->share() seems much simpler to
understand and implement)

> (You don't even need to do that if you just weaken the refs in your
> master list. Perl will replace any that go out of scope with undef.)

Ah, good point.

> I don't know how P::L deals with copying the results back. Presumably
> you have no idea whether a variable has been modified in the sub-process
> or not? What do you do if two sub-processes change the same shared var
> in different ways?

I've mentioned in the pod that only setting of hash keys and pushing
to arrays is supported in the child. I'll append to that that setting
the same key from different iterations preserves a random one of them.

> FWIW, I would cast this API rather differently.

Yeah, I'm beginning to gather that! :-) Fine, you won't be one of
P::L's users I take it...

> You don't seem to be
> trying to emulate the forks API of 'you can do anything you like', but
> instead restricting yourself to iterating over a list.

Exactly.

> In that case, why not have the API like
>
> my $PL = Parallel::Loops->new(sub { dosomething($_) });
> my %results = $PL->foreach(0..9);

I guess if I change that to:

my $PL = Parallel::Loops->new( 4 );
my %results = $PL->foreach( [0..9], sub {
( $_ => dosomething($_) )
});

We could be in business. I'm presuming I can use wantarray() in the
foreach method to test if the caller is going to use the return value
and only transfer the return value from the child if it is going to be
used. It kind of breaks the analogy with foreach but doesn't hurt
otherwise, so why not.

> Well, if the user wrote
>
> my %results;
> {
> my $pl = Parallel::Loops->new;
> $pl->share(\%results);
> $pl->async(sub { $results{$_} = foobar($_) })
> for 0..4;
> }
> useResults \%results;
>
> then a call to ->joinAll in DESTROY would ensure it was called. Since
> variables (particularly those containing potentially-expensive object,
> like $pl) should be minimally-scoped, this would be the correct way to
> write that code.

I don't understand how that can be guaranteed. perldoc perltoot says:

> Perl's notion of the right time to call a destructor is not well-defined
> currently, which is why your destructors should not rely on when they
> are called.

Given that, how can i be sure that DESTROY has been called at the time
of the useResults call?

Peter

From: Ben Morrow on 28 Jun 2010 09:29

Quoth =?ISO-8859-1?Q?Peter_Valdemar_M=F8rch?= <4ux6as402(a)sneakemail.com>:
> On Jun 26, 10:52 pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
> > I was still looking at the question 'why aren't you simply using
> > forks?'. forks handles all this for you.
>
> Well, because I don't want the forks API. I want the foreach
> syntax. :-) The main reason is that it is so much easier to write and
> read later on.

OK.

> > You don't seem to be
> > trying to emulate the forks API of 'you can do anything you like', but
> > instead restricting yourself to iterating over a list.
>
> Exactly.
>
> > In that case, why not have the API like
> >
> > my $PL = Parallel::Loops->new(sub { dosomething($_) });
> > my %results = $PL->foreach(0..9);
>
> I guess if I change that to:
>
> my $PL = Parallel::Loops->new( 4 );
> my %results = $PL->foreach( [0..9], sub {
> ( $_ => dosomething($_) )
> });
>
> We could be in business. I'm presuming I can use wantarray() in the
> foreach method to test if the caller is going to use the return value
> and only transfer the return value from the child if it is going to be
> used. It kind of breaks the analogy with foreach but doesn't hurt
> otherwise, so why not.

It's now more analogous to map than foreach, but I don't see that as a
problem.

>
> > Well, if the user wrote
> >
> > my %results;
> > {
> > my $pl = Parallel::Loops->new;
> > $pl->share(\%results);
> > $pl->async(sub { $results{$_} = foobar($_) })
> > for 0..4;
> > }
> > useResults \%results;
> >
> > then a call to ->joinAll in DESTROY would ensure it was called. Since
> > variables (particularly those containing potentially-expensive object,
> > like $pl) should be minimally-scoped, this would be the correct way to
> > write that code.
>
> I don't understand how that can be guaranteed. perldoc perltoot says:
>
> > Perl's notion of the right time to call a destructor is not well-defined
> > currently, which is why your destructors should not rely on when they
> > are called.
>
> Given that, how can i be sure that DESTROY has been called at the time
> of the useResults call?

Hmm, I'd forgotten that was there. It's complete nonsense: in Perl 5,
destructors are always called promptly, and there are *lots* of modules
relying on that fact so it isn't going to go away. (Perl 6 is a
different matter, of course.)

Ben

From: Willem on 28 Jun 2010 11:07

Peter Valdemar M?rch wrote:
)> > my %output;
)> > $pl->tieOutput( \%output );
)>
)> Why are you using tie here?
)
) Hmm... I thought the idea would be more obvious than it apparently
) is...
)
) Outside the $pl->foreach() loop, we're running in the parent process.
) Inside the $pl->foreach() loop, we're running in a child process. $pl-
)>tieOutput is actually the raison d'etre of Parallel::Loops. When the
) child process has a result, it stores it in %output (which is tied
) with Tie::Hash behind the scenes in the child process).
)
) Behind the scenes, when the child process exits, it sends the results
) (the keys written to %output) back to the parent process's version/
) copy of %output, so that the user of Parallel::Loops doesn't have to
) do any inter-process communication.

Isn't there some easier method, where you don't have to screw around with
output maps at all ?

If the following API would work, that would be the easiest, IMO:

my @result = async_map { do_something($_) } @array;

Where async_map takes care of all the details of creating the threads,
gathering all the output, et cetera. Or does that already exist ?

(The simple implementation is only a few lines of code, but it could
then be easily extended to use a limited number of threads, or keep
a thread pool handy, or something like that.)

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: FAQ 5.1 How do I flush/unbuffer an output filehandle? Why must I do this?
Next: what cpu core is running the script?