Regex driving me crazy... [Python]

Prev: remote multiprocessing, shared object
Next: ftp and python

From: Steven D'Aprano on 7 Apr 2010 22:51

On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote:

> BTW, although I find it annoying when people say "don't do that" when
> "that" is a perfectly good thing to do, and although I also find it
> annoying when people tell you what not to do without telling you what
> *to* do,

Grant did give a perfectly good solution.

> and although I find the regex solution to this problem to be
> quite clean, the equivalent non-regex solution is not terrible, so I
> will present it as well, for your viewing pleasure:
>
> >>> [x for x in '# 1 Short offline Completed without error
> 00%'.split(' ') if x.strip()]
> ['# 1', 'Short offline', ' Completed without error', ' 00%']

This is one of the reasons we're so often suspicious of re solutions:

>>> s = '# 1 Short offline Completed without error 00%'
>>> tre = Timer("re.split(' {2,}', s)",
.... "import re; from __main__ import s")
>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
.... "from __main__ import s")
>>>
>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
True
>>>
>>>
>>> min(tre.repeat(repeat=5))
6.1224789619445801
>>> min(tsplit.repeat(repeat=5))
1.8338048458099365

Even when they are correct and not unreadable line-noise, regexes tend to
be slow. And they get worse as the size of the input increases:

>>> s *= 1000
>>> min(tre.repeat(repeat=5, number=1000))
2.3496899604797363
>>> min(tsplit.repeat(repeat=5, number=1000))
0.41538596153259277
>>>
>>> s *= 10
>>> min(tre.repeat(repeat=5, number=1000))
23.739185094833374
>>> min(tsplit.repeat(repeat=5, number=1000))
4.6444299221038818

And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first
instinct is to write a regex, you're doing it wrong.

[quote]
A related problem is Perl's over-reliance on regular expressions
that is exaggerated by advocating regex-based solution in almost
all O'Reilly books. The latter until recently were the most
authoritative source of published information about Perl.

While simple regular expression is a beautiful thing and can
simplify operations with string considerably, overcomplexity in
regular expressions is extremly dangerous: it cannot serve a basis
for serious, professional programming, it is fraught with pitfalls,
a big semantic mess as a result of outgrowing its primary purpose.
Diagnostic for errors in regular expressions is even weaker then
for the language itself and here many things are just go unnoticed.
[end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml

Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html

--
Steven

From: J on 7 Apr 2010 23:01

On Wed, Apr 7, 2010 at 22:45, Patrick Maupin <pmaupin(a)gmail.com> wrote:

> When I saw "And I am interested in the string that appears in the
> third column, which changes as the test runs and then completes" I
> assumed that, not only could that string change, but so could the one
> before it.
>
> I guess my base assumption that anything with words in it could
> change. I was looking at the OP's attempt at a solution, and he
> obviously felt he needed to see two or more spaces as an item
> delimiter.

I apologize for the confusion, Pat...

I could have worded that better, but at that point I was A:
Frustrated, B: starving, and C: had my wife nagging me to stop working
to come get something to eat ;-)

What I meant was, in that output string, the phrase in the middle
could change in length...
After looking at the source code for smartctl (part of the
smartmontools package for you linux people) I found the switch that
creates those status messages.... they vary in character length, some
with non-text characters like ( and ) and /, and have either 3 or 4
words...

The spaces between each column, instead of being a fixed number of
spaces each, were seemingly arbitrarily created... there may be 4
spaces between two columns or there may be 9, or 7 or who knows what,
and since they were all being treated as individual spaces instead of
tabs or something, I was having trouble splitting the output into
something that was easy to parse (at least in my mind it seemed that
way).

Anyway, that's that... and I do apologize if my original post was
confusing at all...

Cheers
Jeff

From: Patrick Maupin on 7 Apr 2010 23:04

On Apr 7, 9:51 pm, Steven D'Aprano
<ste...(a)REMOVE.THIS.cybersource.com.au> wrote:
> On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote:
> > BTW, although I find it annoying when people say "don't do that" when
> > "that" is a perfectly good thing to do, and although I also find it
> > annoying when people tell you what not to do without telling you what
> > *to* do,
>
> Grant did give a perfectly good solution.

Yeah, I noticed later and apologized for that. What he gave will work
perfectly if the only data that changes the number of words is the
data the OP is looking for. This may or may not be true. I don't
know anything about the program generating the data, but I did notice
that the OP's attempt at an answer indicated that the OP felt (rightly
or wrongly) he needed to split on two or more spaces.

>
> > and although I find the regex solution to this problem to be
> > quite clean, the equivalent non-regex solution is not terrible, so I
> > will present it as well, for your viewing pleasure:
>
> > >>> [x for x in '# 1 Short offline Completed without error
> > 00%'.split(' ') if x.strip()]
> > ['# 1', 'Short offline', ' Completed without error', ' 00%']
>
> This is one of the reasons we're so often suspicious of re solutions:
>
> >>> s = '# 1 Short offline Completed without error 00%'
> >>> tre = Timer("re.split(' {2,}', s)",
>
> ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
>
> ... "from __main__ import s")
>
> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
> True
>
> >>> min(tre.repeat(repeat=5))
> 6.1224789619445801
> >>> min(tsplit.repeat(repeat=5))
>
> 1.8338048458099365
>
> Even when they are correct and not unreadable line-noise, regexes tend to
> be slow. And they get worse as the size of the input increases:
>
> >>> s *= 1000
> >>> min(tre.repeat(repeat=5, number=1000))
> 2.3496899604797363
> >>> min(tsplit.repeat(repeat=5, number=1000))
> 0.41538596153259277
>
> >>> s *= 10
> >>> min(tre.repeat(repeat=5, number=1000))
> 23.739185094833374
> >>> min(tsplit.repeat(repeat=5, number=1000))
>
> 4.6444299221038818
>
> And this isn't even one of the pathological O(N**2) or O(2**N) regexes.
>
> Don't get me wrong -- regexes are a useful tool. But if your first
> instinct is to write a regex, you're doing it wrong.
>
> [quote]
> A related problem is Perl's over-reliance on regular expressions
> that is exaggerated by advocating regex-based solution in almost
> all O'Reilly books. The latter until recently were the most
> authoritative source of published information about Perl.
>
> While simple regular expression is a beautiful thing and can
> simplify operations with string considerably, overcomplexity in
> regular expressions is extremly dangerous: it cannot serve a basis
> for serious, professional programming, it is fraught with pitfalls,
> a big semantic mess as a result of outgrowing its primary purpose..
> Diagnostic for errors in regular expressions is even weaker then
> for the language itself and here many things are just go unnoticed.
> [end quote]
>
> http://www.softpanorama.org/Scripting/Perlbook/Ch01/
> place_of_perl_among_other_lang.shtml
>
> Even Larry Wall has criticised Perl's regex culture:
>
> http://dev.perl.org/perl6/doc/design/apo/A05.html

Bravo!!! Good data, quotes, references, all good stuff!

I absolutely agree that regex shouldn't always be the first thing you
reach for, but I was reading way too much unsubstantiated "this is
bad. Don't do it." on the subject recently. In particular, when
people say "Don't use regex. Use PyParsing!" It may be good advice
in the right context, but it's a bit disingenuous not to mention that
PyParsing will use regex under the covers...

Regards,
Pat

From: Grant Edwards on 7 Apr 2010 23:10

On 2010-04-08, Patrick Maupin <pmaupin(a)gmail.com> wrote:

> Sorry, my eyes completely missed your one-liner, so my criticism about
> not posting a solution was unwarranted. I don't think you and I read
> the problem the same way (which is probably why I didn't notice your
> solution -- because it wasn't solving the problem I thought I saw).

No worries.

> When I saw "And I am interested in the string that appears in the
> third column, which changes as the test runs and then completes" I
> assumed that, not only could that string change, but so could the one
> before it.

If that's the case, my solution won't work right.

> I guess my base assumption that anything with words in it could
> change. I was looking at the OP's attempt at a solution, and he
> obviously felt he needed to see two or more spaces as an item
> delimiter.

If the requirement is indeed two or more spaces as a delimiter with
spaces allowed in any field, then a regular expression split is
probably the best solution.

--
Grant

From: Patrick Maupin on 7 Apr 2010 23:26

On Apr 7, 9:51 pm, Steven D'Aprano
<ste...(a)REMOVE.THIS.cybersource.com.au> wrote:

> This is one of the reasons we're so often suspicious of re solutions:
>
> >>> s = '# 1 Short offline Completed without error 00%'
> >>> tre = Timer("re.split(' {2,}', s)",
>
> ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
>
> ... "from __main__ import s")
>
> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
> True
>
> >>> min(tre.repeat(repeat=5))
> 6.1224789619445801
> >>> min(tsplit.repeat(repeat=5))
>
> 1.8338048458099365

I will confess that, in my zeal to defend re, I gave a simple one-
liner, rather than the more optimized version:

>>> from timeit import Timer
>>> s = '# 1 Short offline Completed without error 00%'
>>> tre = Timer("splitter(s)",
.... "import re; from __main__ import s; splitter =
re.compile(' {2,}').split")
>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
.... "from __main__ import s")
>>> min(tre.repeat(repeat=5))
1.893190860748291
>>> min(tsplit.repeat(repeat=5))
2.0661051273345947

You're right that if you have an 800K byte string, re doesn't perform
as well as split, but the delta is only a few percent.

>>> s *= 10000
>>> min(tre.repeat(repeat=5, number=1000))
15.331652164459229
>>> min(tsplit.repeat(repeat=5, number=1000))
14.596404075622559

Regards,
Pat

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: remote multiprocessing, shared object
Next: ftp and python