From: Nobody on 8 Apr 2010 06:13 On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote: >> Regular expressions != Parsers > > True, but lots of parsers *use* regular expressions in their > tokenizers. In fact, if you have a pure Python parser, you can often > get huge performance gains by rearranging your code slightly so that > you can use regular expressions in your tokenizer, because that > effectively gives you access to a fast, specialized C library that is > built into practically every Python interpreter on the planet. Unfortunately, a typical regexp library (including Python's) doesn't allow you to match against a set of regexps, returning the index of which one matched. Which is what you really want for a tokeniser. >> Every time someone tries to parse nested structures using regular >> expressions, Jamie Zawinski kills a puppy. > > And yet, if you are parsing stuff in Python, and your parser doesn't > use some specialized C code for tokenization (which will probably be > regular expressions unless you are using mxtexttools or some other > specialized C tokenizer code), your nested structure parser will be > dog slow. The point is that you *cannot* match arbitrarily-nested expressions using regexps. You could, in theory, write a regexp which will match any valid syntax up to N levels of nesting, for any finite N. But in practice, the regexp is going to look horrible (and is probably going to be quite inefficient if the regexp library uses backtracking rather than a DFA). Even tokenising with Python's regexp interface is inefficient if the number of token types is large, as you have to test against each regexp sequentially. Ultimately, if you want an efficient parser, you need something with a C component, e.g. Plex.
From: Richard Lamboj on 8 Apr 2010 07:56 At the moment i have less time, so its painful to read about parsing, but it is quite interessting. I have taken a look at the different Parsing Modules and i'am reading the Source Code to understand how they Work. Since Yesterday i'am writing on my own small Engine - Just for Fun and understanding how i can get what i need. It seems that this is hard to code, becouse the logic is complex and sometimes confussing. Its not easy to find a "perfect" solution. If someone knows good links to this thema, or can explain how parsers should/could work, please post it, or explain it. Thanks for the Informations and the Help! Kind Regards Richi
From: Charles on 8 Apr 2010 08:02 "Nobody" <nobody(a)nowhere.com> wrote in message news:pan.2010.04.08.10.12.59.594000(a)nowhere.com... > On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote: > >>> Regular expressions != Parsers >> >> True, but lots of parsers *use* regular expressions in their >> tokenizers. In fact, if you have a pure Python parser, you can often >> get huge performance gains by rearranging your code slightly so that >> you can use regular expressions in your tokenizer, because that >> effectively gives you access to a fast, specialized C library that is >> built into practically every Python interpreter on the planet. > > Unfortunately, a typical regexp library (including Python's) doesn't allow > you to match against a set of regexps, returning the index of which one > matched. Which is what you really want for a tokeniser. > [snip] Really !, I am only a python newbie, but what about ... import re rr = [ ( "id", '([a-zA-Z][a-zA-Z0-9]*)' ), ( "int", '([+-]?[0-9]+)' ), ( "float", '([+-]?[0-9]+\.[0-9]*)' ), ( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' ) ] tlist = [ t[0] for t in rr ] pat = '^ *(' + '|'.join([ t[1] for t in rr ]) + ') *$' p = re.compile(pat) ss = [ ' annc', '1234', 'abcd', ' 234sz ', '-1.24e3', '5.' ] for s in ss: m = p.match(s) if m: ix = [ i-2 for i in range(2,6) if m.group(i) ] print "'"+s+"' matches and has type", tlist[ix[0]] else: print "'"+s+"' does not match" output: ' annc' matches and has type id '1234' matches and has type int 'abcd' matches and has type id ' 234sz ' does not match '-1.24e3' matches and has type float '5.' matches and has type float seems to me to match a (small) set of regular expressions and indirectly return the index of the matched expression, without doing a sequential loop over the regular expressions. Of course there is a loop over the reults of the match to determine which sub-expression matched, but a good regexp library (which I presume Python has) should match the sub-expressions without looping over them. The techniques to do this were well known in the 1970's when the first versons of lex were written. Not that I would recommend tricks like this. The regular expression would quickly get out of hand for any non-trivial list of regular expresssions to match. Charles
From: Neil Cerutti on 10 Apr 2010 11:39 On 2010-04-08, Richard Lamboj <richard.lamboj(a)bilcom.at> wrote: > If someone knows good links to this thema, or can explain how > parsers should/could work, please post it, or explain it. > > Thanks for the Informations and the Help! I liked Crenshaw's "Let's Build a Compiler!". It's pretty trivial to convert his Pascal to Python, and you'll get to basic parsing in no time. URL:http://compilers.iecc.com/crenshaw/ -- Neil Cerutti
From: Patrick Maupin on 10 Apr 2010 12:21 On Apr 8, 5:13 am, Nobody <nob...(a)nowhere.com> wrote: > On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote: > >> Regular expressions != Parsers > > > True, but lots of parsers *use* regular expressions in their > > tokenizers. In fact, if you have a pure Python parser, you can often > > get huge performance gains by rearranging your code slightly so that > > you can use regular expressions in your tokenizer, because that > > effectively gives you access to a fast, specialized C library that is > > built into practically every Python interpreter on the planet. > Unfortunately, a typical regexp library (including Python's) doesn't allow > you to match against a set of regexps, returning the index of which one > matched. Which is what you really want for a tokeniser. Actually, there is some not very-well-documented code in the re module that will let you do exactly that. But even not using that code, using a first cut of re.split() or re.finditer() to break the string apart into tokens (without yet classifying them) is usually a huge performance win over anything else in the standard library (or that you could write in pure Python) for this task. > >> Every time someone tries to parse nested structures using regular > >> expressions, Jamie Zawinski kills a puppy. > > > And yet, if you are parsing stuff in Python, and your parser doesn't > > use some specialized C code for tokenization (which will probably be > > regular expressions unless you are using mxtexttools or some other > > specialized C tokenizer code), your nested structure parser will be > > dog slow. > > The point is that you *cannot* match arbitrarily-nested expressions using > regexps. You could, in theory, write a regexp which will match any valid > syntax up to N levels of nesting, for any finite N. But in practice, the > regexp is going to look horrible (and is probably going to be quite > inefficient if the regexp library uses backtracking rather than a DFA). Trust me, I already knew that. But what you just wrote is a much more useful thing to tell the OP than "Every time someone tries to parse nested structures using regular expressions, Jamie Zawinski kills a puppy" which is what I was responding to. And right after regurgitating that inside joke, Chris Rebert then went on to say "Try using an *actual* parser, such as Pyparsing". Which is all well and good, except then the OP will download pyparsing, take a look, realize that it uses regexps under the hood, and possibly be very confused. > Even tokenising with Python's regexp interface is inefficient if the > number of token types is large, as you have to test against each regexp > sequentially. It's not that bad if you do it right. You can first rip things apart, then use a dict-based scheme to categorize them. > Ultimately, if you want an efficient parser, you need something with a C > component, e.g. Plex. There is no doubt that you can get better performance with C than with Python. But, for a lot of tasks, the Python performance is acceptable, and, as always, algorithm, algorithm, algorithm... A case in point. My pure Python RSON parser is faster on my computer on a real-world dataset of JSON data than the json decoder that comes with Python 2.6, *even with* the json decoder's C speedups enabled. Having said that, the current subversion pure Python simplejson parser is slightly faster than my RSON parser, and the C reimplementation of the parser in current subversion simplejson completely blows the doors off my RSON parser. So, a naive translation to C, even by an experienced programmer, may not do as much for you as an algorithm rework. Regards, Pat
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: personality development Next: PyCon Australia Call For Proposals |