From: Martin v. Loewis on
John Nagle wrote:
> I know there's a performance penalty for running Python on a
> multicore CPU, but how bad is it? I've read the key paper
> ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
> if the GIL just limited Python to running on one CPU at a time,
> but it's worse than that; there's excessive overhead due to
> a lame locking implementation. Running CPU-bound multithreaded
> code on a dual-core CPU runs HALF AS FAST as on a single-core
> CPU, according to Beasley.

I couldn't reproduce these results on Linux. Not sure what "HALF AS
FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I
couldn't reproduce.

If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz)
machine, I get 30s for the sequential execution, 40s for the
multi-threaded case, and 32s for the multi-threaded case when pinning
the Python process to a single CPU (using taskset(1)).

So it's 6% overhead for threading, and 25% penalty for multicore CPUs -
far from the 100% you seem to expect.

Regards,
Martin
From: Ryan Kelly on
On Sun, 2010-02-21 at 22:22 +0100, Martin v. Loewis wrote:
> John Nagle wrote:
> > I know there's a performance penalty for running Python on a
> > multicore CPU, but how bad is it? I've read the key paper
> > ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
> > if the GIL just limited Python to running on one CPU at a time,
> > but it's worse than that; there's excessive overhead due to
> > a lame locking implementation. Running CPU-bound multithreaded
> > code on a dual-core CPU runs HALF AS FAST as on a single-core
> > CPU, according to Beasley.
>
> I couldn't reproduce these results on Linux. Not sure what "HALF AS
> FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I
> couldn't reproduce.
>
> If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz)
> machine, I get 30s for the sequential execution, 40s for the
> multi-threaded case, and 32s for the multi-threaded case when pinning
> the Python process to a single CPU (using taskset(1)).
>
> So it's 6% overhead for threading, and 25% penalty for multicore CPUs -
> far from the 100% you seem to expect.

It's far from scientific, but I've seen behaviour that's close to a 100%
performance penalty on a dual-core linux system:

http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2

Short story: a particular test suite of mine used to run in around 25
seconds, but a bit of ctypes magic to set thread affinity dropped the
running time to under 13 seconds.


Cheers,

Ryan

--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan(a)rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details

From: Martin v. Loewis on
> It's far from scientific, but I've seen behaviour that's close to a 100%
> performance penalty on a dual-core linux system:
>
> http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2
>
> Short story: a particular test suite of mine used to run in around 25
> seconds, but a bit of ctypes magic to set thread affinity dropped the
> running time to under 13 seconds.

Indeed, it's not scientific - but with a few more details, you could
improve it quite a lot: what specific Linux distribution (the posting
doesn't even say it's Linux), what specific Python version had you been
using? (less important) what CPUs? If you can: what specific test suite?

A lot of science is about repeatability. Making a systematic study is
(IMO) over-valued - anecdotal reports are useful, too, as long as they
allow for repeatable experiments.

Regards,
Martin
From: Ryan Kelly on
On Sun, 2010-02-21 at 23:05 +0100, Martin v. Loewis wrote:
> > It's far from scientific, but I've seen behaviour that's close to a 100%
> > performance penalty on a dual-core linux system:
> >
> > http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2
> >
> > Short story: a particular test suite of mine used to run in around 25
> > seconds, but a bit of ctypes magic to set thread affinity dropped the
> > running time to under 13 seconds.
>
> Indeed, it's not scientific - but with a few more details, you could
> improve it quite a lot: what specific Linux distribution (the posting
> doesn't even say it's Linux), what specific Python version had you been
> using? (less important) what CPUs? If you can: what specific test suite?

I'm on Ubuntu Karmic, Python 2.6.4, an AMD Athlon 7750 dual core.

Unfortunately the test suite is for a proprietary application. I've
been able to reproduce similar behaviour with an open-source test suite,
using the current trunk of the "pyfilesystem" project:

http://code.google.com/p/pyfilesystem/


In this project "OSFS" is an object-oriented interface to the local
filesystem. The test case "TestOSFS.test_cases_in_separate_dirs" runs
three theads, each doing a bunch of IO in a different directory.

Running the tests normally:

rfk(a)durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS..test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 9.787s


That's the best result from five runs - I saw it go as high as 12
seconds. Watching it in top, I see CPU usage at around 150%.

Now using threading2 to set the process cpu affinity at the start of the
test run:

rfk(a)durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS..test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 3.792s


Again, best of five. The variability in times here is much lower - I
never saw it go above 4 seconds. CPU usage is consistently 100%.



Cheers,

Ryan


--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan(a)rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details