Fastest way to calculate leading whitespace [Python]

Prev: MBT shoes($62,1:1 quality),online shopping www.promptc.com
Next: Need help with my 1st python program

From: Mark Dickinson on 9 May 2010 09:34

On May 9, 6:13 am, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 08 May 2010 13:46:59 -0700, Mark Dickinson wrote:
> >> However, s[:-len(t)] should be both faster and correct.
>
> > Unless len(t) == 0, surely?
>
> Doh! The hazards of insufficient testing. Thanks for catching that.

I have a love-hate relationship with the negative index semantics for
exactly this reason: code like 'x[-n]' always seems smelly to me.
It's often not what the code author actually wanted, except when n is
guaranteed strictly positive for some reason. 'x[-1]' is fine, of
course.

--
Mark

From: dasacc22 on 9 May 2010 14:59

On May 9, 8:28 am, John Machin <sjmac...(a)lexicon.net> wrote:
> dasacc22 <dasacc22 <at> gmail.com> writes:
>
>
>
> > U presume entirely to much. I have a preprocessor that normalizes
> > documents while performing other more complex operations. Theres
> > nothing buggy about what im doing
>
> Are you sure?
>
> Your "solution" calculates (the number of leading whitespace characters) + (the
> number of TRAILING whitespace characters).
>
> Problem 1: including TRAILING whitespace.
> Example: "content" + 3 * " " + "\n" has 4 leading spaces according to your
> reckoning; should be 0.
> Fix: use lstrip() instead of strip()
>
> Problem 2: assuming all whitespace characters have *effective* width the same as
> " ".
> Examples: TAB has width 4 or 8 or whatever you want it to be. There are quite a
> number of whitespace characters, even when you stick to ASCII. When you look at
> Unicode, there are heaps more. Here's a list of BMP characters such that
> character.isspace() is True, showing the Unicode codepoint, the Python repr(),
> and the name of the character (other than for control characters):
>
> U+0009 u'\t' ?
> U+000A u'\n' ?
> U+000B u'\x0b' ?
> U+000C u'\x0c' ?
> U+000D u'\r' ?
> U+001C u'\x1c' ?
> U+001D u'\x1d' ?
> U+001E u'\x1e' ?
> U+001F u'\x1f' ?
> U+0020 u' ' SPACE
> U+0085 u'\x85' ?
> U+00A0 u'\xa0' NO-BREAK SPACE
> U+1680 u'\u1680' OGHAM SPACE MARK
> U+2000 u'\u2000' EN QUAD
> U+2001 u'\u2001' EM QUAD
> U+2002 u'\u2002' EN SPACE
> U+2003 u'\u2003' EM SPACE
> U+2004 u'\u2004' THREE-PER-EM SPACE
> U+2005 u'\u2005' FOUR-PER-EM SPACE
> U+2006 u'\u2006' SIX-PER-EM SPACE
> U+2007 u'\u2007' FIGURE SPACE
> U+2008 u'\u2008' PUNCTUATION SPACE
> U+2009 u'\u2009' THIN SPACE
> U+200A u'\u200a' HAIR SPACE
> U+200B u'\u200b' ZERO WIDTH SPACE
> U+2028 u'\u2028' LINE SEPARATOR
> U+2029 u'\u2029' PARAGRAPH SEPARATOR
> U+202F u'\u202f' NARROW NO-BREAK SPACE
> U+205F u'\u205f' MEDIUM MATHEMATICAL SPACE
> U+3000 u'\u3000' IDEOGRAPHIC SPACE
>
> Hmmm, looks like all kinds of widths, from zero upwards.

I unfortunately mixed the solution with a string that would never make
it in the state i typed it in, the trailing whitespace

This is my fault

From: Stefan Behnel on 10 May 2010 02:54

dasacc22, 08.05.2010 19:19:
> This is a simple question. I'm looking for the fastest way to
> calculate the leading whitespace (as a string, ie ' ').

Here is an (untested) Cython 0.13 solution:

from cpython.unicode cimport Py_UNICODE_ISSPACE

def leading_whitespace(unicode ustring):
cdef Py_ssize_t i
cdef Py_UNICODE uchar

for i, uchar in enumerate(ustring):
if not Py_UNICODE_ISSPACE(uchar):
return ustring[:i]
return ustring

Cython compiles this to the obvious C code, so this should be impossible to
beat in plain Python code.

However, since Cython 0.13 hasn't been officially released yet (may take
another couple of weeks or so), you'll need to use the current developer
version from here:

http://hg.cython.org/cython-devel

Stefan

From: Stefan Behnel on 10 May 2010 03:25

Stefan Behnel, 10.05.2010 08:54:
> dasacc22, 08.05.2010 19:19:
>> This is a simple question. I'm looking for the fastest way to
>> calculate the leading whitespace (as a string, ie ' ').
>
> Here is an (untested) Cython 0.13 solution:
>
> from cpython.unicode cimport Py_UNICODE_ISSPACE
>
> def leading_whitespace(unicode ustring):
> cdef Py_ssize_t i
> cdef Py_UNICODE uchar
>
> for i, uchar in enumerate(ustring):
> if not Py_UNICODE_ISSPACE(uchar):
> return ustring[:i]
> return ustring
>
> Cython compiles this to the obvious C code, so this should be impossible
> to beat in plain Python code.

.... and it is. For a simple string like

u = u" abcdefg" + u"fsdf"*20

timeit gives me this for "s=u.lstrip(); u[:-len(s)]":

1000000 loops, best of 3: 0.404 usec per loop

and this for "leading_whitespace(u)":

10000000 loops, best of 3: 0.0901 usec per loop

It's closer for the extreme case of an all whitespace string like " "*60,
where I get this for the lstrip variant:

1000000 loops, best of 3: 0.277 usec per loop

and this for the Cython code:

10000000 loops, best of 3: 0.177 usec per loop

But I doubt that this is the main use case of the OP.

Stefan

From: dasacc22 on 12 May 2010 00:10

On May 10, 2:25 am, Stefan Behnel <stefan...(a)behnel.de> wrote:
> Stefan Behnel, 10.05.2010 08:54:
>
>
>
>
>
> > dasacc22, 08.05.2010 19:19:
> >> This is a simple question. I'm looking for the fastest way to
> >> calculate the leading whitespace (as a string, ie ' ').
>
> > Here is an (untested) Cython 0.13 solution:
>
> > from cpython.unicode cimport Py_UNICODE_ISSPACE
>
> > def leading_whitespace(unicode ustring):
> > cdef Py_ssize_t i
> > cdef Py_UNICODE uchar
>
> > for i, uchar in enumerate(ustring):
> > if not Py_UNICODE_ISSPACE(uchar):
> > return ustring[:i]
> > return ustring
>
> > Cython compiles this to the obvious C code, so this should be impossible
> > to beat in plain Python code.
>
> ... and it is. For a simple string like
>
> u = u" abcdefg" + u"fsdf"*20
>
> timeit gives me this for "s=u.lstrip(); u[:-len(s)]":
>
> 1000000 loops, best of 3: 0.404 usec per loop
>
> and this for "leading_whitespace(u)":
>
> 10000000 loops, best of 3: 0.0901 usec per loop
>
> It's closer for the extreme case of an all whitespace string like " "*60,
> where I get this for the lstrip variant:
>
> 1000000 loops, best of 3: 0.277 usec per loop
>
> and this for the Cython code:
>
> 10000000 loops, best of 3: 0.177 usec per loop
>
> But I doubt that this is the main use case of the OP.
>
> Stefan

indeed, actually ive been going back and forth on the idea to use
cython for some more intensive portions. That bit of code looks really
simple so I think I'll give cython a shot. Only deal is I need to be
able to use w/e the latest cython is available via easy_install, but
this should prove an interesting experience.

First | Prev |
Pages: 1 2 3 4
Prev: MBT shoes($62,1:1 quality),online shopping www.promptc.com
Next: Need help with my 1st python program