How to print all expressions that match a regular expression [Python]

Prev: intolerant HTML parser
Next: how to fix bugs (python)

From: hzhuo1 on 6 Feb 2010 23:16

>
> So it seems we both misunderstood the problem.
>
> I didn't read the top level article until now, and reading it, I can't make
> sense of it.
>

Seems that you should read the whole thing before making a post, or
else you cannot know what we are talking about.
Steven doesn't misunderstand me. We are talking about what I need, and
he tries to help.

>
> > "Given the function hashlib.sha256, enumerate all the possible inputs
> > that give the hexadecimal result
> > 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."
>
> I tried some "parrot" variants but no dice. :-(
>
> [snip]
>

This is a hash collision problem. Nobody has proved that SHA-256 is
collision free, even not in the random oracle model, because people
always suppose that a random oracle exists, and make hash function its
substitution. That means it may be broken someday. And any provable
security based on random oracle model is not secure.

> > I'm suggesting that, in general, there's no way to tell in advance which
> > regexes will be easy and which will be hard, and even when they are easy,
> > the enumeration will often be infinite.

It is hard to tell in advance. However, we can add some timing limit
or counting limit, to make it an algorithm, which can halt. For
example, whenever the program outputs more than 1000000 expressions
that match the input regex, we can halt because that exceeds our
limit. But surely this is not efficient because of the post-decision.

>
> Essentially, any regexp that includes '+' or '*' (directly or via e.g. notation
> that denotes "digit sequence") yields an infinite number of strings.

Infinity is really relative, not absolute. It is relative to the
computing speed. For example, the regex '^[0|1]{2048}$' is rather
simple and doesn't contain '+' or '$', but trying to output all
expressions that match it has a complexity of 2^2048. If we can do
that, then we can break RSA-2048.
We must face the reality .

Zhuo

From: Alf P. Steinbach on 6 Feb 2010 23:38

* hzhuo1(a)gmail.com:
>> So it seems we both misunderstood the problem.
>>
>> I didn't read the top level article until now, and reading it, I can't make
>> sense of it.
>>
>
> [1] Seems that you should read the whole thing before making a post, or
> else you cannot know what we are talking about.
> Steven doesn't misunderstand me. We are talking about what I need, and
> he tries to help.

If you were not misunderstood then you've posted a question for which there's no
practical solution.

>>> "Given the function hashlib.sha256, enumerate all the possible inputs
>>> that give the hexadecimal result
>>> 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."
>> I tried some "parrot" variants but no dice. :-(
>>
>> [snip]
>>
>
> This is a hash collision problem. Nobody has proved that SHA-256 is
> collision free, even not in the random oracle model, because people
> always suppose that a random oracle exists, and make hash function its
> substitution. That means it may be broken someday. And any provable
> security based on random oracle model is not secure.

Stephen's little challenge wasn't about breaking SHA-256 but about guessing his
secret phrase, given his clues.

>>> I'm suggesting that, in general, there's no way to tell in advance which
>>> regexes will be easy and which will be hard, and even when they are easy,
>>> the enumeration will often be infinite.
>
> It is hard to tell in advance.

No, it's trivial.

> [2] However, we can add some timing limit
> or counting limit, to make it an algorithm, which can halt. For
> example, whenever the program outputs more than 1000000 expressions
> that match the input regex, we can halt because that exceeds our
> limit. But surely this is not efficient because of the post-decision.

You don't need to wait for that output to complete. You can calculate the number
of strings up front. Like it appears that you do below:

>> Essentially, any regexp that includes '+' or '*' (directly or via e.g. notation
>> that denotes "digit sequence") yields an infinite number of strings.
>
> Infinity is really relative, not absolute. It is relative to the
> computing speed. For example, the regex '^[0|1]{2048}$' is rather
> simple and doesn't contain '+' or '$', but trying to output all
> expressions that match it has a complexity of 2^2048. If we can do
> that, then we can break RSA-2048.
> We must face the reality .

Here it seems that you have no problem calculating number of combinations, yet
above, at the paragraph marked [2], you talk about waiting for a million strings
to be output before seeing that it's too much, and earlier, at the paragraph
marked [1], you maintain that your original question about generating all such
strings (completely impractical) was what you wanted help with?

I'm sorry but I can't make sense of this; it appears to be meaningless.

Perhaps if you tried to clarify the requirements a bit.

Cheers & hth.,

- Alf

From: Nobody on 7 Feb 2010 00:09

On Sun, 07 Feb 2010 00:26:36 +0000, Steven D'Aprano wrote:

>> So there isn't such a routine just because some of the regular
>> expressions cannot be enumerated.

No. There isn't a routine because no-one has yet felt any need to write
one.

>> However, some of them can be
>> enumerated. I guess I have to write a function myself.
>
> How do you expect to tell the ones that can be enumerated apart from
> those that can't be?

The ones which use the '+', '*' and '{m,}' operators match an infinite
number of strings; the rest can only match a finite number (assuming POSIX
REs; Python also has +? and *?).

["Enumerate" isn't the correct word here. You can *enumerate* an
infinite set, in the sense that you could write a Python generator for
which any member will eventually be generated.]

The obvious implementation is to construct the NFA then "run" it.

If you know that the RE can only match finite strings (i.e. the graph is
acyclic), then you can enumerate them using depth-first traversal. If it
can match infinite strings (i.e. if there are any cycles in the graph),
then you would need to use either breadth-first traversal or
incrementally-bounded depth-first traversal.

> [Aside: Python regexes aren't Turing Complete. I'm not sure about Perl
> regexes. Either way, this might actually be less difficult than the
> Halting Problem, as in "amazingly difficult" rather than "impossible".]

"Regular expressions" aren't Turing complete; this is implicit in the
definition of "regular". The Chomsky hierarchy has four levels, with
higher levels require a more capable system to decide whether a string is
a member of the language defined by the grammar:

grammar decidable by

regular finite automaton
context-free pushdown automaton[1]
context-sensitive linear-bounded automaton[2]
recursively-enumerable Turing machine

However, any "regular expression" syntax which allows backreferences
(including the POSIX specification) isn't actually "regular" in the formal
sense (as it requires an infinite number of states), but context-free.

[1] pushdown automaton = finite automaton with a stack

[2] linear-bounded automaton = Turing machine, except that it's "tape" is
finite and proportional to the size of the input.

http://en.wikipedia.org/wiki/Chomsky_hierarchy

From: Steve Holden on 7 Feb 2010 08:07

hzhuo1(a)gmail.com wrote:
>> So it seems we both misunderstood the problem.
>>
>> I didn't read the top level article until now, and reading it, I can't make
>> sense of it.
>>
>
> Seems that you should read the whole thing before making a post, or
> else you cannot know what we are talking about.
> Steven doesn't misunderstand me. We are talking about what I need, and
> he tries to help.
>
>
>
>>> "Given the function hashlib.sha256, enumerate all the possible inputs
>>> that give the hexadecimal result
>>> 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."
>> I tried some "parrot" variants but no dice. :-(
>>
>> [snip]
>>
>
> This is a hash collision problem. Nobody has proved that SHA-256 is
> collision free, even not in the random oracle model, because people
> always suppose that a random oracle exists, and make hash function its
> substitution. That means it may be broken someday. And any provable
> security based on random oracle model is not secure.
>
It's very easy to prove that no hash function is collision-free, since
the domain (all possible inputs) is much larger than the range (all
possible outputs). Hence there must be many inputs that map to the same
output.

A *good* hash function is unpredictable enough to make finding two
colliding strings impractical - and even the best hash functions that
cryptographers could devise at the time have been broken. We should
remember that "broken" to a cryptographer means something rather
different than it does in common usage, so a broken scheme need not
necessarily be dropped immediately - one would just stop using it in new
systems.

>
>>> I'm suggesting that, in general, there's no way to tell in advance which
>>> regexes will be easy and which will be hard, and even when they are easy,
>>> the enumeration will often be infinite.
>
> It is hard to tell in advance. However, we can add some timing limit
> or counting limit, to make it an algorithm, which can halt. For
> example, whenever the program outputs more than 1000000 expressions
> that match the input regex, we can halt because that exceeds our
> limit. But surely this is not efficient because of the post-decision.
>
>> Essentially, any regexp that includes '+' or '*' (directly or via e.g. notation
>> that denotes "digit sequence") yields an infinite number of strings.
>
> Infinity is really relative, not absolute. It is relative to the
> computing speed. For example, the regex '^[0|1]{2048}$' is rather
> simple and doesn't contain '+' or '$', but trying to output all
> expressions that match it has a complexity of 2^2048. If we can do
> that, then we can break RSA-2048.
> We must face the reality .
>
I have always understood that there's a pretty real distinction between
"finite" and "infinite". Are you telling me I am wrong, or are you
merely saying that some finite cases might just as well be infinite for
practical purposes?

And I really don't see how simple enumeration of range(2^2048) breaks
RSA-2048, since that problem requires you to find two factors which,
when multiplied together, give that specific value.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

From: Tim Chase on 7 Feb 2010 08:47

hzhuo1(a)gmail.com wrote:
>>> "Given the function hashlib.sha256, enumerate all the possible inputs
>>> that give the hexadecimal result
>>> 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."
>
> This is a hash collision problem. Nobody has proved that SHA-256 is
> collision free

It's actually pretty easy to prove that it is *not* collision
free. The SHA-256 encodes 512 bits of data. So the the process
of encoding (2**512)+1 distinct inputs incurs a collision in
SHA-256 space as soon as you've hit (2**512)+1 if not earlier.

to start you off:

sha_backmap = {}
for i in xrange((2**512)+2):
hash = sha(str(i))
if hash in sha_backmap:
print "Collision found: %i and %i" % (
i, sha_backmap[hash])
break
sha_backmap[hash] = i

Though it might take a computer the size of the universe, so I'm
guessing that the first collision encountered is with "42". I
leave the actual calculation and hashing of all possible
combinations of 513 bits of data as an exercise to the reader
with a lot of time on their hands or a quantum computer under
their desk ;-)

> It is hard to tell in advance. However, we can add some timing limit
> or counting limit, to make it an algorithm, which can halt. For
> example, whenever the program outputs more than 1000000 expressions
> that match the input regex, we can halt because that exceeds our
> limit. But surely this is not efficient because of the post-decision.

As mentioned, it sounds like you either want a depth-first of the
solution space that raises exceptions on an infinite/unbounded
operator ("*", "+", and "{N,}" as mentioned in another email), or
if you want to handle those operators, do a breadth-first search
of the solution-space and track your depth (or time taken, or
previous number of multi-factor atoms if you desire) to ensure
you don't exceed a certain depth. But you're still talking a
combinatorial number of solutions for even simple regexps.

-tkc

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: intolerant HTML parser
Next: how to fix bugs (python)