Python and Regular Expressions [Python]

Prev: personality development
Next: PyCon Australia Call For Proposals

From: Nobody on 8 Apr 2010 06:13

On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:

>> Regular expressions != Parsers
>
> True, but lots of parsers *use* regular expressions in their
> tokenizers. In fact, if you have a pure Python parser, you can often
> get huge performance gains by rearranging your code slightly so that
> you can use regular expressions in your tokenizer, because that
> effectively gives you access to a fast, specialized C library that is
> built into practically every Python interpreter on the planet.

Unfortunately, a typical regexp library (including Python's) doesn't allow
you to match against a set of regexps, returning the index of which one
matched. Which is what you really want for a tokeniser.

>> Every time someone tries to parse nested structures using regular
>> expressions, Jamie Zawinski kills a puppy.
>
> And yet, if you are parsing stuff in Python, and your parser doesn't
> use some specialized C code for tokenization (which will probably be
> regular expressions unless you are using mxtexttools or some other
> specialized C tokenizer code), your nested structure parser will be
> dog slow.

The point is that you *cannot* match arbitrarily-nested expressions using
regexps. You could, in theory, write a regexp which will match any valid
syntax up to N levels of nesting, for any finite N. But in practice, the
regexp is going to look horrible (and is probably going to be quite
inefficient if the regexp library uses backtracking rather than a DFA).

Even tokenising with Python's regexp interface is inefficient if the
number of token types is large, as you have to test against each regexp
sequentially.

Ultimately, if you want an efficient parser, you need something with a C
component, e.g. Plex.

From: Richard Lamboj on 8 Apr 2010 07:56

At the moment i have less time, so its painful to read about parsing, but it
is quite interessting.

I have taken a look at the different Parsing Modules and i'am reading the
Source Code to understand how they Work. Since Yesterday i'am writing on my
own small Engine - Just for Fun and understanding how i can get what i need.

It seems that this is hard to code, becouse the logic is complex and sometimes
confussing. Its not easy to find a "perfect" solution.

If someone knows good links to this thema, or can explain how parsers
should/could work, please post it, or explain it.

Thanks for the Informations and the Help!

Kind Regards

Richi

From: Charles on 8 Apr 2010 08:02

"Nobody" <nobody(a)nowhere.com> wrote in message
news:pan.2010.04.08.10.12.59.594000(a)nowhere.com...
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
>
>>> Regular expressions != Parsers
>>
>> True, but lots of parsers *use* regular expressions in their
>> tokenizers. In fact, if you have a pure Python parser, you can often
>> get huge performance gains by rearranging your code slightly so that
>> you can use regular expressions in your tokenizer, because that
>> effectively gives you access to a fast, specialized C library that is
>> built into practically every Python interpreter on the planet.
>
> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.
>
[snip]

Really !,
I am only a python newbie, but what about ...

import re
rr = [
( "id", '([a-zA-Z][a-zA-Z0-9]*)' ),
( "int", '([+-]?[0-9]+)' ),
( "float", '([+-]?[0-9]+\.[0-9]*)' ),
( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' )
]
tlist = [ t[0] for t in rr ]
pat = '^ *(' + '|'.join([ t[1] for t in rr ]) + ') *$'
p = re.compile(pat)

ss = [ ' annc', '1234', 'abcd', ' 234sz ', '-1.24e3', '5.' ]
for s in ss:
m = p.match(s)
if m:
ix = [ i-2 for i in range(2,6) if m.group(i) ]
print "'"+s+"' matches and has type", tlist[ix[0]]
else:
print "'"+s+"' does not match"

output:
' annc' matches and has type id
'1234' matches and has type int
'abcd' matches and has type id
' 234sz ' does not match
'-1.24e3' matches and has type float
'5.' matches and has type float

seems to me to match a (small) set of regular expressions and
indirectly return the index of the matched expression, without
doing a sequential loop over the regular expressions.

Of course there is a loop over the reults of the match to determine
which sub-expression matched, but a good regexp library (which
I presume Python has) should match the sub-expressions without
looping over them. The techniques to do this were well known in
the 1970's when the first versons of lex were written.

Not that I would recommend tricks like this. The regular
expression would quickly get out of hand for any non-trivial
list of regular expresssions to match.

Charles

From: Neil Cerutti on 10 Apr 2010 11:39

On 2010-04-08, Richard Lamboj <richard.lamboj(a)bilcom.at> wrote:
> If someone knows good links to this thema, or can explain how
> parsers should/could work, please post it, or explain it.
>
> Thanks for the Informations and the Help!

I liked Crenshaw's "Let's Build a Compiler!". It's pretty trivial
to convert his Pascal to Python, and you'll get to basic parsing
in no time. URL:http://compilers.iecc.com/crenshaw/

--
Neil Cerutti

From: Patrick Maupin on 10 Apr 2010 12:21

On Apr 8, 5:13 am, Nobody <nob...(a)nowhere.com> wrote:
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
> >> Regular expressions != Parsers
>
> > True, but lots of parsers *use* regular expressions in their
> > tokenizers. In fact, if you have a pure Python parser, you can often
> > get huge performance gains by rearranging your code slightly so that
> > you can use regular expressions in your tokenizer, because that
> > effectively gives you access to a fast, specialized C library that is
> > built into practically every Python interpreter on the planet.

> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.

Actually, there is some not very-well-documented code in the re module
that will let you do exactly that. But even not using that code,
using a first cut of re.split() or re.finditer() to break the string
apart into tokens (without yet classifying them) is usually a huge
performance win over anything else in the standard library (or that
you could write in pure Python) for this task.

> >> Every time someone tries to parse nested structures using regular
> >> expressions, Jamie Zawinski kills a puppy.
>
> > And yet, if you are parsing stuff in Python, and your parser doesn't
> > use some specialized C code for tokenization (which will probably be
> > regular expressions unless you are using mxtexttools or some other
> > specialized C tokenizer code), your nested structure parser will be
> > dog slow.
>
> The point is that you *cannot* match arbitrarily-nested expressions using
> regexps. You could, in theory, write a regexp which will match any valid
> syntax up to N levels of nesting, for any finite N. But in practice, the
> regexp is going to look horrible (and is probably going to be quite
> inefficient if the regexp library uses backtracking rather than a DFA).

Trust me, I already knew that. But what you just wrote is a much more
useful thing to tell the OP than "Every time someone tries to parse
nested structures using regular expressions, Jamie Zawinski kills a
puppy" which is what I was responding to. And right after
regurgitating that inside joke, Chris Rebert then went on to say "Try
using an *actual* parser, such as Pyparsing". Which is all well and
good, except then the OP will download pyparsing, take a look, realize
that it uses regexps under the hood, and possibly be very confused.

> Even tokenising with Python's regexp interface is inefficient if the
> number of token types is large, as you have to test against each regexp
> sequentially.

It's not that bad if you do it right. You can first rip things apart,
then use a dict-based scheme to categorize them.

> Ultimately, if you want an efficient parser, you need something with a C
> component, e.g. Plex.

There is no doubt that you can get better performance with C than with
Python. But, for a lot of tasks, the Python performance is
acceptable, and, as always, algorithm, algorithm, algorithm...

A case in point. My pure Python RSON parser is faster on my computer
on a real-world dataset of JSON data than the json decoder that comes
with Python 2.6, *even with* the json decoder's C speedups enabled.

Having said that, the current subversion pure Python simplejson parser
is slightly faster than my RSON parser, and the C reimplementation of
the parser in current subversion simplejson completely blows the doors
off my RSON parser.

So, a naive translation to C, even by an experienced programmer, may
not do as much for you as an algorithm rework.

Regards,
Pat

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: personality development
Next: PyCon Australia Call For Proposals