Fastest way to calculate leading whitespace [Python]

Prev: MBT shoes($62,1:1 quality),online shopping www.promptc.com
Next: Need help with my 1st python program

From: Steven D'Aprano on 8 May 2010 15:46

On Sat, 08 May 2010 12:15:22 -0700, Wolfram Hinderer wrote:

> On 8 Mai, 20:46, Steven D'Aprano <st...(a)REMOVE-THIS- cybersource.com.au>
> wrote:
>
>> def get_leading_whitespace(s):
>> t = s.lstrip()
>> return s[:len(s)-len(t)]
>>
>> >>> c = get_leading_whitespace(a)
>> >>> assert c == leading_whitespace
>>
>> Unless your strings are very large, this is likely to be faster than
>> any other pure-Python solution you can come up with.
>
> Returning s[:-1 - len(t)] is faster.

I'm sure it is. Unfortunately, it's also incorrect.

>>> z = "*****abcde"
>>> z[:-1-5]
'****'
>>> z[:len(z)-5]
'*****'

However, s[:-len(t)] should be both faster and correct.

--
Steven

From: Mark Dickinson on 8 May 2010 16:46

On May 8, 8:46 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 08 May 2010 12:15:22 -0700, Wolfram Hinderer wrote:
> > On 8 Mai, 20:46, Steven D'Aprano <st...(a)REMOVE-THIS- cybersource.com.au>
> > wrote:
>
> >> def get_leading_whitespace(s):
> >> t = s.lstrip()
> >> return s[:len(s)-len(t)]
>
> >> >>> c = get_leading_whitespace(a)
> >> >>> assert c == leading_whitespace
>
> >> Unless your strings are very large, this is likely to be faster than
> >> any other pure-Python solution you can come up with.
>
> > Returning s[:-1 - len(t)] is faster.
>
> I'm sure it is. Unfortunately, it's also incorrect.
>
> >>> z = "*****abcde"
> >>> z[:-1-5]
> '****'
> >>> z[:len(z)-5]
>
> '*****'
>
> However, s[:-len(t)] should be both faster and correct.

Unless len(t) == 0, surely?

--
Mark

From: dasacc22 on 8 May 2010 17:27

U presume entirely to much. I have a preprocessor that normalizes
documents while performing other more complex operations. Theres
nothing buggy about what im doing

On May 8, 1:46 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 08 May 2010 10:19:16 -0700, dasacc22 wrote:
> > Hi
>
> > This is a simple question. I'm looking for the fastest way to calculate
> > the leading whitespace (as a string, ie ' ').
>
> Is calculating the amount of leading whitespace really the bottleneck in
> your application? If not, then trying to shave off microseconds from
> something which is a trivial part of your app is almost certainly a waste
> of your time.
>
> [...]
>
> > a = ' some content\n'
> > b = a.strip()
> > c = ' '*(len(a)-len(b))
>
> I take it that you haven't actually tested this code for correctness,
> because it's buggy. Let's test it:
>
> >>> leading_whitespace = " "*2 + "\t"*2
> >>> a = leading_whitespace + "some non-whitespace text\n"
> >>> b = a.strip()
> >>> c = " "*(len(a)-len(b))
> >>> assert c == leading_whitespace
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError
>
> Not only doesn't it get the whitespace right, but it doesn't even get the
> *amount* of whitespace right:
>
> >>> assert len(c) == len(leading_whitespace)
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError
>
> It doesn't even work correctly if you limit "whitespace" to mean spaces
> and nothing else! It's simply wrong in every possible way.
>
> This is why people say that premature optimization is the root of all
> (programming) evil. Instead of wasting time and energy trying to optimise
> code, you should make it correct first.
>
> Your solutions 2 and 3 are also buggy. And solution 3 can be easily re-
> written to be more straightforward. Instead of the complicated:
>
> > def get_leading_whitespace(s):
> > def _get():
> > for x in s:
> > if x != ' ':
> > break
> > yield x
> > return ''.join(_get())
>
> try this version:
>
> def get_leading_whitespace(s):
> accumulator = []
> for c in s:
> if c in ' \t\v\f\r\n':
> accumulator.append(c)
> else:
> break
> return ''.join(accumulator)
>
> Once you're sure this is correct, then you can optimise it:
>
> def get_leading_whitespace(s):
> t = s.lstrip()
> return s[:len(s)-len(t)]
>
> >>> c = get_leading_whitespace(a)
> >>> assert c == leading_whitespace
>
> Unless your strings are very large, this is likely to be faster than any
> other pure-Python solution you can come up with.
>
> --
> Steven

From: Patrick Maupin on 8 May 2010 18:18

On May 8, 1:16 pm, dasacc22 <dasac...(a)gmail.com> wrote:
> On May 8, 12:59 pm, Patrick Maupin <pmau...(a)gmail.com> wrote:
>
>
>
> > On May 8, 12:19 pm, dasacc22 <dasac...(a)gmail.com> wrote:
>
> > > Hi
>
> > > This is a simple question. I'm looking for the fastest way to
> > > calculate the leading whitespace (as a string, ie ' ').
>
> > > Here are some different methods I have tried so far
> > > --- solution 1
>
> > > a = ' some content\n'
> > > b = a.strip()
> > > c = ' '*(len(a)-len(b))
>
> > > --- solution 2
>
> > > a = ' some content\n'
> > > b = a.strip()
> > > c = a.partition(b[0])[0]
>
> > > --- solution 3
>
> > > def get_leading_whitespace(s):
> > > def _get():
> > > for x in s:
> > > if x != ' ':
> > > break
> > > yield x
> > > return ''.join(_get())
>
> > > ---
>
> > > Solution 1 seems to be about as fast as solution 2 except in certain
> > > circumstances where the value of b has already been determined for
> > > other purposes. Solution 3 is slower due to the function overhead.
>
> > > Curious to see what other types of solutions people might have.
>
> > > Thanks,
> > > Daniel
>
> > Well, you could try a solution using re, but that's probably only
> > likely to be faster if you can use it on multiple concatenated lines.
> > I usually use something like your solution #1. One thing to be aware
> > of, though, is that strip() with no parameters will strip *any*
> > whitespace, not just spaces, so the implicit assumption in your code
> > that what you have stripped is spaces may not be justified (depending
> > on the source data). OTOH, depending on how you use that whitespace
> > information, it may not really matter. But if it does matter, you can
> > use strip(' ')
>
> > If speed is really an issue for you, you could also investigate
> > mxtexttools, but, like re, it might perform better if the source
> > consists of several batched lines.
>
> > Regards,
> > Pat
>
> Hi,
>
> thanks for the info. Using .strip() to remove all whitespace in
> solution 1 is a must. If you only stripped ' ' spaces then line
> endings would get counted in the len() call and when multiplied
> against ' ', would produce an inaccurate result. Regex is
> significantly slower for my purposes but ive never heard of
> mxtexttools. Even if it proves slow its spurred my curiousity as to
> what functionality it provides (on an unrelated note)

Could you reorganize your code to do multiple lines at a time? That
might make regex competitive.

Regards,
Pat

From: dasacc22 on 8 May 2010 23:48

On May 8, 5:18 pm, Patrick Maupin <pmau...(a)gmail.com> wrote:
> On May 8, 1:16 pm, dasacc22 <dasac...(a)gmail.com> wrote:
>
>
>
>
>
> > On May 8, 12:59 pm, Patrick Maupin <pmau...(a)gmail.com> wrote:
>
> > > On May 8, 12:19 pm, dasacc22 <dasac...(a)gmail.com> wrote:
>
> > > > Hi
>
> > > > This is a simple question. I'm looking for the fastest way to
> > > > calculate the leading whitespace (as a string, ie ' ').
>
> > > > Here are some different methods I have tried so far
> > > > --- solution 1
>
> > > > a = ' some content\n'
> > > > b = a.strip()
> > > > c = ' '*(len(a)-len(b))
>
> > > > --- solution 2
>
> > > > a = ' some content\n'
> > > > b = a.strip()
> > > > c = a.partition(b[0])[0]
>
> > > > --- solution 3
>
> > > > def get_leading_whitespace(s):
> > > > def _get():
> > > > for x in s:
> > > > if x != ' ':
> > > > break
> > > > yield x
> > > > return ''.join(_get())
>
> > > > ---
>
> > > > Solution 1 seems to be about as fast as solution 2 except in certain
> > > > circumstances where the value of b has already been determined for
> > > > other purposes. Solution 3 is slower due to the function overhead.
>
> > > > Curious to see what other types of solutions people might have.
>
> > > > Thanks,
> > > > Daniel
>
> > > Well, you could try a solution using re, but that's probably only
> > > likely to be faster if you can use it on multiple concatenated lines.
> > > I usually use something like your solution #1. One thing to be aware
> > > of, though, is that strip() with no parameters will strip *any*
> > > whitespace, not just spaces, so the implicit assumption in your code
> > > that what you have stripped is spaces may not be justified (depending
> > > on the source data). OTOH, depending on how you use that whitespace
> > > information, it may not really matter. But if it does matter, you can
> > > use strip(' ')
>
> > > If speed is really an issue for you, you could also investigate
> > > mxtexttools, but, like re, it might perform better if the source
> > > consists of several batched lines.
>
> > > Regards,
> > > Pat
>
> > Hi,
>
> > thanks for the info. Using .strip() to remove all whitespace in
> > solution 1 is a must. If you only stripped ' ' spaces then line
> > endings would get counted in the len() call and when multiplied
> > against ' ', would produce an inaccurate result. Regex is
> > significantly slower for my purposes but ive never heard of
> > mxtexttools. Even if it proves slow its spurred my curiousity as to
> > what functionality it provides (on an unrelated note)
>
> Could you reorganize your code to do multiple lines at a time? That
> might make regex competitive.
>
> Regards,
> Pat

I have tried this already, the problem here is that it's not a trivial
matter. Iterating over each line is unavoidable, and I found that
using various python builtins to perform string operations (like say
the wonderful partition builtin) during each iteration works 3 fold
faster then regexing the entire document with various needs. Another
issue is having to keep a line count and when iterating over regex
matches and counting lines, it doesn't scale nearly as well as a
straight python solution using builtins to process the information.

At the heart of this here, determining the leading white-space is a
trivial matter. I have much more complex problems to deal with. I was
much more interested in seeing what kind of solutions ppl would come
up with to such a problem, and perhaps uncover something new in python
that I can apply to a more complex problem. What spurred the thought
was this piece written up by guido concerning "what's the best way to
convert a list of integers into a string". It's a simple question
where concepts are introduced that can lead to solving more complex
problems.

http://www.python.org/doc/essays/list2str.html

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: MBT shoes($62,1:1 quality),online shopping www.promptc.com
Next: Need help with my 1st python program