From: dasacc22 on
On May 8, 2:46 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 08 May 2010 12:15:22 -0700, Wolfram Hinderer wrote:
> > On 8 Mai, 20:46, Steven D'Aprano <st...(a)REMOVE-THIS- cybersource.com.au>
> > wrote:
>
> >> def get_leading_whitespace(s):
> >>     t = s.lstrip()
> >>     return s[:len(s)-len(t)]
>
> >> >>> c = get_leading_whitespace(a)
> >> >>> assert c == leading_whitespace
>
> >> Unless your strings are very large, this is likely to be faster than
> >> any other pure-Python solution you can come up with.
>
> > Returning s[:-1 - len(t)] is faster.
>
> I'm sure it is. Unfortunately, it's also incorrect.
>
> >>> z = "*****abcde"
> >>> z[:-1-5]
> '****'
> >>> z[:len(z)-5]
>
> '*****'
>
> However, s[:-len(t)] should be both faster and correct.
>
> --
> Steven

This is without a doubt faster and simpler then any solution thus far.
Thank you for this
From: Steven D'Aprano on
On Sat, 08 May 2010 13:46:59 -0700, Mark Dickinson wrote:

>> However, s[:-len(t)] should be both faster and correct.
>
> Unless len(t) == 0, surely?

Doh! The hazards of insufficient testing. Thanks for catching that.



--
Steven
From: Steven D'Aprano on
On Sat, 08 May 2010 14:27:32 -0700, dasacc22 wrote:

> U presume entirely to much. I have a preprocessor that normalizes
> documents while performing other more complex operations. Theres
> nothing buggy about what im doing

I didn't *presume* anything, I took your example code and ran it and
discovered that it didn't do what you said it was doing.


--
Steven
From: Wolfram Hinderer on
On 8 Mai, 21:46, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 08 May 2010 12:15:22 -0700, Wolfram Hinderer wrote:
> > Returning s[:-1 - len(t)] is faster.
>
> I'm sure it is. Unfortunately, it's also incorrect.

> However, s[:-len(t)] should be both faster and correct.

Ouch. Thanks for correcting me.

No, I'll never tell how that -1 crept in...
From: John Machin on
dasacc22 <dasacc22 <at> gmail.com> writes:

>
> U presume entirely to much. I have a preprocessor that normalizes
> documents while performing other more complex operations. Theres
> nothing buggy about what im doing

Are you sure?

Your "solution" calculates (the number of leading whitespace characters) + (the
number of TRAILING whitespace characters).

Problem 1: including TRAILING whitespace.
Example: "content" + 3 * " " + "\n" has 4 leading spaces according to your
reckoning; should be 0.
Fix: use lstrip() instead of strip()

Problem 2: assuming all whitespace characters have *effective* width the same as
" ".
Examples: TAB has width 4 or 8 or whatever you want it to be. There are quite a
number of whitespace characters, even when you stick to ASCII. When you look at
Unicode, there are heaps more. Here's a list of BMP characters such that
character.isspace() is True, showing the Unicode codepoint, the Python repr(),
and the name of the character (other than for control characters):

U+0009 u'\t' ?
U+000A u'\n' ?
U+000B u'\x0b' ?
U+000C u'\x0c' ?
U+000D u'\r' ?
U+001C u'\x1c' ?
U+001D u'\x1d' ?
U+001E u'\x1e' ?
U+001F u'\x1f' ?
U+0020 u' ' SPACE
U+0085 u'\x85' ?
U+00A0 u'\xa0' NO-BREAK SPACE
U+1680 u'\u1680' OGHAM SPACE MARK
U+2000 u'\u2000' EN QUAD
U+2001 u'\u2001' EM QUAD
U+2002 u'\u2002' EN SPACE
U+2003 u'\u2003' EM SPACE
U+2004 u'\u2004' THREE-PER-EM SPACE
U+2005 u'\u2005' FOUR-PER-EM SPACE
U+2006 u'\u2006' SIX-PER-EM SPACE
U+2007 u'\u2007' FIGURE SPACE
U+2008 u'\u2008' PUNCTUATION SPACE
U+2009 u'\u2009' THIN SPACE
U+200A u'\u200a' HAIR SPACE
U+200B u'\u200b' ZERO WIDTH SPACE
U+2028 u'\u2028' LINE SEPARATOR
U+2029 u'\u2029' PARAGRAPH SEPARATOR
U+202F u'\u202f' NARROW NO-BREAK SPACE
U+205F u'\u205f' MEDIUM MATHEMATICAL SPACE
U+3000 u'\u3000' IDEOGRAPHIC SPACE

Hmmm, looks like all kinds of widths, from zero upwards.