Fastest way to calculate leading whitespace [Python]

Prev: MBT shoes($62,1:1 quality),online shopping www.promptc.com
Next: Need help with my 1st python program

From: dasacc22 on 8 May 2010 13:19

Hi

This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie ' ').

Here are some different methods I have tried so far
--- solution 1

a = ' some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))

--- solution 2

a = ' some content\n'
b = a.strip()
c = a.partition(b[0])[0]

--- solution 3

def get_leading_whitespace(s):
def _get():
for x in s:
if x != ' ':
break
yield x
return ''.join(_get())

---

Solution 1 seems to be about as fast as solution 2 except in certain
circumstances where the value of b has already been determined for
other purposes. Solution 3 is slower due to the function overhead.

Curious to see what other types of solutions people might have.

Thanks,
Daniel

From: Patrick Maupin on 8 May 2010 13:59

On May 8, 12:19 pm, dasacc22 <dasac...(a)gmail.com> wrote:
> Hi
>
> This is a simple question. I'm looking for the fastest way to
> calculate the leading whitespace (as a string, ie ' ').
>
> Here are some different methods I have tried so far
> --- solution 1
>
> a = ' some content\n'
> b = a.strip()
> c = ' '*(len(a)-len(b))
>
> --- solution 2
>
> a = ' some content\n'
> b = a.strip()
> c = a.partition(b[0])[0]
>
> --- solution 3
>
> def get_leading_whitespace(s):
> def _get():
> for x in s:
> if x != ' ':
> break
> yield x
> return ''.join(_get())
>
> ---
>
> Solution 1 seems to be about as fast as solution 2 except in certain
> circumstances where the value of b has already been determined for
> other purposes. Solution 3 is slower due to the function overhead.
>
> Curious to see what other types of solutions people might have.
>
> Thanks,
> Daniel

Well, you could try a solution using re, but that's probably only
likely to be faster if you can use it on multiple concatenated lines.
I usually use something like your solution #1. One thing to be aware
of, though, is that strip() with no parameters will strip *any*
whitespace, not just spaces, so the implicit assumption in your code
that what you have stripped is spaces may not be justified (depending
on the source data). OTOH, depending on how you use that whitespace
information, it may not really matter. But if it does matter, you can
use strip(' ')

If speed is really an issue for you, you could also investigate
mxtexttools, but, like re, it might perform better if the source
consists of several batched lines.

Regards,
Pat

From: dasacc22 on 8 May 2010 14:16

On May 8, 12:59 pm, Patrick Maupin <pmau...(a)gmail.com> wrote:
> On May 8, 12:19 pm, dasacc22 <dasac...(a)gmail.com> wrote:
>
>
>
>
>
> > Hi
>
> > This is a simple question. I'm looking for the fastest way to
> > calculate the leading whitespace (as a string, ie ' ').
>
> > Here are some different methods I have tried so far
> > --- solution 1
>
> > a = ' some content\n'
> > b = a.strip()
> > c = ' '*(len(a)-len(b))
>
> > --- solution 2
>
> > a = ' some content\n'
> > b = a.strip()
> > c = a.partition(b[0])[0]
>
> > --- solution 3
>
> > def get_leading_whitespace(s):
> > def _get():
> > for x in s:
> > if x != ' ':
> > break
> > yield x
> > return ''.join(_get())
>
> > ---
>
> > Solution 1 seems to be about as fast as solution 2 except in certain
> > circumstances where the value of b has already been determined for
> > other purposes. Solution 3 is slower due to the function overhead.
>
> > Curious to see what other types of solutions people might have.
>
> > Thanks,
> > Daniel
>
> Well, you could try a solution using re, but that's probably only
> likely to be faster if you can use it on multiple concatenated lines.
> I usually use something like your solution #1. One thing to be aware
> of, though, is that strip() with no parameters will strip *any*
> whitespace, not just spaces, so the implicit assumption in your code
> that what you have stripped is spaces may not be justified (depending
> on the source data). OTOH, depending on how you use that whitespace
> information, it may not really matter. But if it does matter, you can
> use strip(' ')
>
> If speed is really an issue for you, you could also investigate
> mxtexttools, but, like re, it might perform better if the source
> consists of several batched lines.
>
> Regards,
> Pat

Hi,

thanks for the info. Using .strip() to remove all whitespace in
solution 1 is a must. If you only stripped ' ' spaces then line
endings would get counted in the len() call and when multiplied
against ' ', would produce an inaccurate result. Regex is
significantly slower for my purposes but ive never heard of
mxtexttools. Even if it proves slow its spurred my curiousity as to
what functionality it provides (on an unrelated note)

From: Steven D'Aprano on 8 May 2010 14:46

On Sat, 08 May 2010 10:19:16 -0700, dasacc22 wrote:

> Hi
>
> This is a simple question. I'm looking for the fastest way to calculate
> the leading whitespace (as a string, ie ' ').

Is calculating the amount of leading whitespace really the bottleneck in
your application? If not, then trying to shave off microseconds from
something which is a trivial part of your app is almost certainly a waste
of your time.

[...]
> a = ' some content\n'
> b = a.strip()
> c = ' '*(len(a)-len(b))

I take it that you haven't actually tested this code for correctness,
because it's buggy. Let's test it:

>>> leading_whitespace = " "*2 + "\t"*2
>>> a = leading_whitespace + "some non-whitespace text\n"
>>> b = a.strip()
>>> c = " "*(len(a)-len(b))
>>> assert c == leading_whitespace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

Not only doesn't it get the whitespace right, but it doesn't even get the
*amount* of whitespace right:

>>> assert len(c) == len(leading_whitespace)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

It doesn't even work correctly if you limit "whitespace" to mean spaces
and nothing else! It's simply wrong in every possible way.

This is why people say that premature optimization is the root of all
(programming) evil. Instead of wasting time and energy trying to optimise
code, you should make it correct first.

Your solutions 2 and 3 are also buggy. And solution 3 can be easily re-
written to be more straightforward. Instead of the complicated:

> def get_leading_whitespace(s):
> def _get():
> for x in s:
> if x != ' ':
> break
> yield x
> return ''.join(_get())

try this version:

def get_leading_whitespace(s):
accumulator = []
for c in s:
if c in ' \t\v\f\r\n':
accumulator.append(c)
else:
break
return ''.join(accumulator)

Once you're sure this is correct, then you can optimise it:

def get_leading_whitespace(s):
t = s.lstrip()
return s[:len(s)-len(t)]

>>> c = get_leading_whitespace(a)
>>> assert c == leading_whitespace
>>>

Unless your strings are very large, this is likely to be faster than any
other pure-Python solution you can come up with.

--
Steven

From: Wolfram Hinderer on 8 May 2010 15:15

On 8 Mai, 20:46, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:

> def get_leading_whitespace(s):
> t = s.lstrip()
> return s[:len(s)-len(t)]
>
> >>> c = get_leading_whitespace(a)
> >>> assert c == leading_whitespace
>
> Unless your strings are very large, this is likely to be faster than any
> other pure-Python solution you can come up with.

Returning s[:-1 - len(t)] is faster.

| Next | Last
Pages: 1 2 3 4
Prev: MBT shoes($62,1:1 quality),online shopping www.promptc.com
Next: Need help with my 1st python program