From: Nathan Harmston on
Hi everyone,

I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match (to compare
performance of regex against an acora based matcher). Using the
regular expression as a grammar to generate all words in its language.
I was wondering if this possible in Python or possible using anything.
Google doesnt seem to give any obvious answers.

Many thanks in advance,

Nathan
From: Gabriel Genellina on
En Wed, 31 Mar 2010 07:49:14 -0300, Nathan Harmston
<iwanttobeabadger(a)googlemail.com> escribi�:

> I have a slightly complicated/medium sized regular expression and I
> want to generate all possible words that it can match (to compare
> performance of regex against an acora based matcher). Using the
> regular expression as a grammar to generate all words in its language.
> I was wondering if this possible in Python or possible using anything.
> Google doesnt seem to give any obvious answers.

I've done this some time ago.
This recipe
http://code.activestate.com/recipes/577041-merge-multiple-potentially-infinite-sorted-inputs-/
provides an infinite merge operation, required for effectively enumerating
a regular language (with shorter strings first, else 'a*b' would get stuck
repeating "a"s and never yielding any "b").
The example at the end shows the basics of how to use it (by hand) in a
very simple case.
To enumerate all strings matching '(a|bc)*' one should invoke
closure(product(['a'],['b','c'])), which gives:

'', 'a', 'aa', 'bc',
'aaa', 'abc', 'bca',
'aaaa', 'aabc', 'abca', 'bcaa', 'bcbc',
'aaaaa', 'aaabc', 'aabca', 'abcaa', 'abcbc', 'bcaaa', 'bcabc', 'bcbca',
....

Converting the former expression into the later function calls requires a
parser (not shown -- and I won't be able to find it until Friday)

--
Gabriel Genellina

From: Tim Chase on
Nathan Harmston wrote:
> I have a slightly complicated/medium sized regular expression
> and I want to generate all possible words that it can match
> (to compare performance of regex against an acora based
> matcher). Using the regular expression as a grammar to
> generate all words in its language. I was wondering if this
> possible in Python or possible using anything. Google doesnt
> seem to give any obvious answers.

Unless you limit your regexp to bounded regular expressions (you
don't use the "*", "+" and unbounded-top-end "{...}" tokens), it
has an infinite number of possible words that can match. As a
simple example, the regexp "a*" matches "" (the empty string),
"a", "aa", "aaa", "aaaa"...and so on up to the limit of a string
length.

There was a discussion on this a while back:

http://mail.python.org/pipermail/python-list/2010-February/1235120.html

so you might chat with the person who started that thread. You
basically have to write a regexp parser that generates code to
generate the expressions, and even simple expressions such as
"[a-z]{1,30}" can create an astronomical number of possible inputs.

-tkc



From: Grant Edwards on
On 2010-03-31, Nathan Harmston <iwanttobeabadger(a)googlemail.com> wrote:

> I have a slightly complicated/medium sized regular expression and I
> want to generate all possible words that it can match
>
> I was wondering if this possible in Python or possible using
> anything. Google doesnt seem to give any obvious answers.

We did this one a couple weeks ago.

It's not possible in the general case (there are an infinite number of
matching words for many/most regular expressions).

--
Grant Edwards grant.b.edwards Yow! ... the MYSTERIANS are
at in here with my CORDUROY
gmail.com SOAP DISH!!
From: Paul McGuire on
On Mar 31, 5:49 am, Nathan Harmston <iwanttobeabad...(a)googlemail.com>
wrote:
> Hi everyone,
>
> I have a slightly complicated/medium sized regular expression and I
> want to generate all possible words that it can match (to compare
> performance of regex against an acora based matcher).

The pyparsing wiki Examples page includes this regex inverter:
http://pyparsing.wikispaces.com/file/view/invRegex.py

From the module header:
# Supports:
# - {n} and {m,n} repetition, but not unbounded + or * repetition
# - ? optional elements
# - [] character ranges
# - () grouping
# - | alternation

-- Paul