From: nolo contendere on
Scenario:
I am expecting 3 files in a drop directory. They won't
necessarily all arrive at the same time. I want to begin processing
the each file as soon as it arrives (or as close to arrival time as is
reasonable). Would the best way to go about this be to simply have a
script that takes a filename as a parameter and marks the file as
'currently processing' when it begins to process the file (or could
move the file to a different directory)?

I could kick off 3 daemon processes looking in the drop directory, and
sleep every 5 secs, for instance.

That seems to me, to be a straightforward, if clumsy, approach. I was
wondering if there was a module that could accomplish this task more
elegantly--Parallel::ForkManager, at least in my experience, doesn't
seem entirely suited to this particular task.

Or I could code my own fork,exec,wait/waitpid.

I know TMTOWTDI, but I was seeking to benefit from others' experience,
and for a 'best practice'.

Sorry there's no tangible code; this is more of a conceptual question
I guess.
From: Peter Makholm on
nolo contendere <simon.chao(a)fmr.com> writes:

> I know TMTOWTDI, but I was seeking to benefit from others' experience,
> and for a 'best practice'.

If portability isn't a issue, you platform might support some kind of
monitoring of parts of the filesystem. Then you can get events when
files are created in you spool directory og moved there.

Linux::Inotify2 is a linux only-solution I'm using for a couple of
scripts. Another usable module coudl be SGI::FAM, which should be
supported on a broader range of unices.

I have been looking for something like Net::Server for spool dirs a
couple of times without finding anything really useful.

//Makholm
From: xhoster on
nolo contendere <simon.chao(a)fmr.com> wrote:
> Scenario:
> I am expecting 3 files in a drop directory. They won't
> necessarily all arrive at the same time. I want to begin processing
> the each file as soon as it arrives (or as close to arrival time as is
> reasonable).

What is the relationship between the 3 files? Presumably, this whole
thing will happen more than once, right, otherwise you wouldn't need
to automate it? So what is the difference between "3 files show up,
and that happens 30 times" and just "90 files show up"?

> Would the best way to go about this be to simply have a
> script that takes a filename as a parameter and marks the file as
> 'currently processing' when it begins to process the file (or could
> move the file to a different directory)?
>
> I could kick off 3 daemon processes looking in the drop directory, and
> sleep every 5 secs, for instance.

Do the file's contents show up atomically with the file's name? If not,
the process could see the file is there and start processing it, even
though it is not completely written yet.

>
> That seems to me, to be a straightforward, if clumsy, approach. I was
> wondering if there was a module that could accomplish this task more
> elegantly--Parallel::ForkManager, at least in my experience, doesn't
> seem entirely suited to this particular task.

Why don't you think it is suited? It seems well suited, unless there
are details that I am missing (or maybe you are Windows or something where
forking isn't as robust).

my $pm=Parallel::ForkManager->new(3);
foreach my $file (@ARGV) {
$pm->start() and next;
process($file);
$pm->finish();
};
$pm->wait_all_children();

Where process() subroutine first waits for the named $file to exist, then
processes it.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
From: Ted Zlatanov on
On Wed, 23 Apr 2008 11:29:42 -0700 (PDT) nolo contendere <simon.chao(a)fmr.com> wrote:

nc> I am expecting 3 files in a drop directory. They won't
nc> necessarily all arrive at the same time. I want to begin processing
nc> the each file as soon as it arrives (or as close to arrival time as is
nc> reasonable). Would the best way to go about this be to simply have a
nc> script that takes a filename as a parameter and marks the file as
nc> 'currently processing' when it begins to process the file (or could
nc> move the file to a different directory)?

nc> I could kick off 3 daemon processes looking in the drop directory, and
nc> sleep every 5 secs, for instance.

nc> That seems to me, to be a straightforward, if clumsy, approach. I was
nc> wondering if there was a module that could accomplish this task more
nc> elegantly--Parallel::ForkManager, at least in my experience, doesn't
nc> seem entirely suited to this particular task.

nc> Or I could code my own fork,exec,wait/waitpid.

Get Tie::ShareLite from CPAN.

In each process, lock a shared hash and insert an entry for the new file
when it's noticed in the idle loop. If the file already exists in the
hash, do nothing. The first process to notice the file wins.

Now, unlock the hash and work with the file. When done, move the file
out, lock the hash again, and remove the entry you inserted.

The advantage is that you can store much more in the hash than just the
filename, so this is handy for complex processing. Also, no file
renaming is needed.

A simpler version is just to rename the file to "$file.$$" where $$ is
your PID. If, after the rename, the renamed file is there, your process
won against the others and you can work with the file. Note there could
be name collisions with an existing file, but since PIDs are unique on
the machine, you can just remove that bogus file. Just be aware this is
the quick and dirty solution.

Another approach is to use a Maildir structure, which can handle
multiple readers and writers atomically, even over NFS. You just need
to map your incoming queue into a Maildir structure; there's no need to
actually have mail in the files. This is good if you expect lots of
volume, network access, etc. complications to your original model.

Ted
From: Ben Morrow on

Quoth Peter Makholm <peter(a)makholm.net>:
> nolo contendere <simon.chao(a)fmr.com> writes:
>
> > I know TMTOWTDI, but I was seeking to benefit from others' experience,
> > and for a 'best practice'.
>
> If portability isn't a issue, you platform might support some kind of
> monitoring of parts of the filesystem. Then you can get events when
> files are created in you spool directory og moved there.
>
> Linux::Inotify2 is a linux only-solution I'm using for a couple of
> scripts. Another usable module coudl be SGI::FAM, which should be
> supported on a broader range of unices.

SGI::FAM only works under Irix. I've been meaning to port it to other
systems that support fam (and gamin, the GNU rewrite) but haven't got
round to it yet. There is Sys::Gamin, but it doesn't have any tests and
doesn't appear to be maintained.

Other OS-specific alternatives include IO::KQueue for BSDish systems,
and Win32::ChangeNotify for Win32. This seems like a perfect opportunity
for someone to write an OS-independant wrapper module, but AFAIK no-one
has yet.

Ben