From: Stefan Behnel on
Paul Rubin, 04.02.2010 02:51:
> John Nagle writes:
>> Analysis of each domain is
>> performed in a separate process, but each process uses multiple
>> threads to read process several web pages simultaneously.
>>
>> Some of the threads go compute-bound for a second or two at a time as
>> they parse web pages.
>
> You're probably better off using separate processes for the different
> pages. If I remember, you were using BeautifulSoup, which while very
> cool, is pretty doggone slow for use on large volumes of pages. I don't
> know if there's much that can be done about that without going off on a
> fairly messy C or C++ coding adventure. Maybe someday someone will do
> that.

Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan
From: Paul Rubin on
Stefan Behnel <stefan_ml(a)behnel.de> writes:
> Well, if multi-core performance is so important here, then there's a pretty
> simple thing the OP can do: switch to lxml.
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML. The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors. Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python. Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe. Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example. But, for web crawling, you really do need to
handle the messy and wrong HTML properly.

From: Antoine Pitrou on
Le Tue, 02 Feb 2010 15:02:49 -0800, John Nagle a écrit :
> I know there's a performance penalty for running Python on a multicore
> CPU, but how bad is it? I've read the key paper
> ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate if
> the GIL just limited Python to running on one CPU at a time, but it's
> worse than that; there's excessive overhead due to a lame locking
> implementation. Running CPU-bound multithreaded code on a dual-core CPU
> runs HALF AS FAST as on a single-core CPU, according to Beasley.

That's on certain types of workloads, and perhaps on certain OSes, so you
should try benching your own workload to see whether it applies.

Two closing remarks:
- this should (hopefully) be fixed in 3.2, as exarkun noticed
- instead of spawning one thread per Web page, you could use Twisted or
another event loop mechanism in order to process pages serially, in the
order of arrival

Regards

Antoine.


From: J Kenneth King on
Paul Rubin <no.email(a)nospam.invalid> writes:

> Stefan Behnel <stefan_ml(a)behnel.de> writes:
>> Well, if multi-core performance is so important here, then there's a pretty
>> simple thing the OP can do: switch to lxml.
>>
>> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
> only works on well-formed XML. The point of Beautiful Soup is that it
> works on all kinds of garbage hand-written legacy HTML with mismatched
> tags and other sorts of errors. Beautiful Soup is slower because it's
> full of special cases and hacks for that reason, and it is written in
> Python. Writing something that complex in C to handle so much
> potentially malicious input would be quite a lot of work to write at
> all, and very difficult to ensure was really safe. Look at the many
> browser vulnerabilities we've seen over the years due to that sort of
> problem, for example. But, for web crawling, you really do need to
> handle the messy and wrong HTML properly.

If the difference is great enough, you might get a benefit from
analyzing all pages with lxml and throwing invalid pages into a bucket
for later processing with BeautifulSoup.
From: John Krukoff on
On Mon, 2010-02-08 at 01:10 -0800, Paul Rubin wrote:
> Stefan Behnel <stefan_ml(a)behnel.de> writes:
> > Well, if multi-core performance is so important here, then there's a pretty
> > simple thing the OP can do: switch to lxml.
> >
> > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
> only works on well-formed XML. The point of Beautiful Soup is that it
> works on all kinds of garbage hand-written legacy HTML with mismatched
> tags and other sorts of errors. Beautiful Soup is slower because it's
> full of special cases and hacks for that reason, and it is written in
> Python. Writing something that complex in C to handle so much
> potentially malicious input would be quite a lot of work to write at
> all, and very difficult to ensure was really safe. Look at the many
> browser vulnerabilities we've seen over the years due to that sort of
> problem, for example. But, for web crawling, you really do need to
> handle the messy and wrong HTML properly.
>

Actually, lxml has an HTML parser which does pretty well with the
standard level of broken one finds most often on the web. And, when it
falls down, it's easy to integrate BeautifulSoup as a slow backup for
when things go really wrong (as J Kenneth King mentioned earlier):

http://codespeak.net/lxml/lxmlhtml.html#parsing-html

At least in my experience, I haven't actually had to parse anything that
lxml couldn't handle yet, however.
--
John Krukoff <jkrukoff(a)ltgc.com>
Land Title Guarantee Company