From: John Nagle on
Paul Rubin wrote:
> John Nagle <nagle(a)animats.com> writes:
>> Analysis of each domain is
>> performed in a separate process, but each process uses multiple
>> threads to read process several web pages simultaneously.
>>
>> Some of the threads go compute-bound for a second or two at a time as
>> they parse web pages.
>
> You're probably better off using separate processes for the different
> pages. If I remember, you were using BeautifulSoup, which while very
> cool, is pretty doggone slow for use on large volumes of pages. I don't
> know if there's much that can be done about that without going off on a
> fairly messy C or C++ coding adventure. Maybe someday someone will do
> that.

I already use separate processes for different domains. I could
live with Python's GIL as long as moving to a multicore server
doesn't make performance worse. That's why I asked about CPU dedication
for each process, to avoid thrashing at the GIL.

There's enough intercommunication between the threads working on
a single site that it's a pain to do them as subprocesses. And I
definitely don't want to launch subprocesses for each page; the
Python load time would be worse than the actual work. The
subprocess module assumes you're willing to launch a subprocess
for each transaction.

The current program organization is that there's a scheduler
process which gets requests, prioritizes them, and runs the requested
domains through the site evaluation mill. The scheduler maintains a
pool of worker processes which get work request via their input pipe, in Pickle
format, and return results, again in Pickle format. When not in
use, the worker processes sit there dormant, so there's no Python
launch cost for each transaction. If a worker process crashes, the
scheduler replaces it with a fresh one, and every few hundred uses,
each worker process is replaced with a fresh copy, in case Python
has a memory leak. It's a lot like the way
FCGI works.

Scheduling is managed using an in-memory
table in MySQL, so the load can be spread over a cluster if desired,
with a scheduler process on each machine.

So I already have a scalable architecture. The only problem
is excess overhead on multicore CPUs.

John Nagle
From: Steve Holden on
John Nagle wrote:
> Paul Rubin wrote:
>> John Nagle <nagle(a)animats.com> writes:
>>> Analysis of each domain is
>>> performed in a separate process, but each process uses multiple
>>> threads to read process several web pages simultaneously.
>>>
>>> Some of the threads go compute-bound for a second or two at a time as
>>> they parse web pages.
>>
>> You're probably better off using separate processes for the different
>> pages. If I remember, you were using BeautifulSoup, which while very
>> cool, is pretty doggone slow for use on large volumes of pages. I don't
>> know if there's much that can be done about that without going off on a
>> fairly messy C or C++ coding adventure. Maybe someday someone will do
>> that.
>
> I already use separate processes for different domains. I could
> live with Python's GIL as long as moving to a multicore server
> doesn't make performance worse. That's why I asked about CPU dedication
> for each process, to avoid thrashing at the GIL.
>
I believe it's already been said that the GIL thrashing is mostly MacOS
specific. You might also find something in the affinity module

http://pypi.python.org/pypi/affinity/0.1.0

to ensure that each process in your pool runs on only one processor.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

From: John Nagle on
Steve Holden wrote:
> John Nagle wrote:
>> Paul Rubin wrote:
>>> John Nagle <nagle(a)animats.com> writes:
>>>> Analysis of each domain is
>>>> performed in a separate process, but each process uses multiple
>>>> threads to read process several web pages simultaneously.
>>>>
>>>> Some of the threads go compute-bound for a second or two at a time as
>>>> they parse web pages.
>>> You're probably better off using separate processes for the different
>>> pages. If I remember, you were using BeautifulSoup, which while very
>>> cool, is pretty doggone slow for use on large volumes of pages. I don't
>>> know if there's much that can be done about that without going off on a
>>> fairly messy C or C++ coding adventure. Maybe someday someone will do
>>> that.
>> I already use separate processes for different domains. I could
>> live with Python's GIL as long as moving to a multicore server
>> doesn't make performance worse. That's why I asked about CPU dedication
>> for each process, to avoid thrashing at the GIL.
>>
> I believe it's already been said that the GIL thrashing is mostly MacOS
> specific. You might also find something in the affinity module

No, the original analysis was MacOS oriented, but the same mechanism
applies for fighting over the GIL on all platforms. There was some
pontification that it might be a MacOS-only issue, but no facts
were presented. It might be cheaper on C implementations with mutexes
that don't make system calls for the non-blocking cases.

John Nagle
From: Paul Rubin on
John Nagle <nagle(a)animats.com> writes:
> There's enough intercommunication between the threads working on
> a single site that it's a pain to do them as subprocesses. And I
> definitely don't want to launch subprocesses for each page; the
> Python load time would be worse than the actual work. The
> subprocess module assumes you're willing to launch a subprocess
> for each transaction.

Why not just use socketserver and have something like a fastcgi?
From: Anh Hai Trinh on
On Feb 4, 10:46 am, John Nagle <na...(a)animats.com> wrote:
>
>     There's enough intercommunication between the threads working on
> a single site that it's a pain to do them as subprocesses. And I
> definitely don't want to launch subprocesses for each page; the
> Python load time would be worse than the actual work.  The
> subprocess module assumes you're willing to launch a subprocess
> for each transaction.

You could perhaps use a process pool inside each domain worker to work
on the pages? There is multiprocessing.Pool and other
implementations.

For examples, in this library, you can s/ThreadPool/ProcessPool/g and
this example would work: <http://www.onideas.ws/stream.py/#retrieving-
web-pages-concurrently>.

If you want to DIY, with multiprocessing.Lock/Pipe/Queue, I don't
understand why it would be more of a pain to write your threads as
processes.


// aht
http://blog.onideas.ws