re.sub unexpected behaviour [Python]

Prev: Download Microsoft C/C++ compiler for use with Python 2.6/2.7ASAP
Next: automatic adding files to an archive in distutils

From: Javier Collado on 6 Jul 2010 13:10

Hello,

Let's imagine that we have a simple function that generates a
replacement for a regular expression:

def process(match):
return match.string

If we use that simple function with re.sub using a simple pattern and
a string we get the expected output:
re.sub('123', process, '123')
'123'

However, if the string passed to re.sub contains a trailing new line
character, then we get an extra new line character unexpectedly:
re.sub(r'123', process, '123\n')
'123\n\n'

If we try to get the same result using a replacement string, instead
of a function, the strange behaviour cannot be reproduced:
re.sub(r'123', '123', '123')
'123'

re.sub('123', '123', '123\n')
'123\n'

Is there any explanation for this? If I'm skipping something when
using a replacement function with re.sub, please let me know.

Best regards,
Javier

From: Steven D'Aprano on 6 Jul 2010 13:32

On Tue, 06 Jul 2010 19:10:17 +0200, Javier Collado wrote:

> Hello,
>
> Let's imagine that we have a simple function that generates a
> replacement for a regular expression:
>
> def process(match):
> return match.string
>
> If we use that simple function with re.sub using a simple pattern and a
> string we get the expected output:
> re.sub('123', process, '123')
> '123'
>
> However, if the string passed to re.sub contains a trailing new line
> character, then we get an extra new line character unexpectedly:
> re.sub(r'123', process, '123\n')
> '123\n\n'

I don't know why you say it is unexpected. The regex "123" matched the
first three characters of "123\n". Those three characters are replaced by
a copy of the string you are searching "123\n", which gives "123\n\n"
exactly as expected.

Perhaps these examples might help:

>>> re.sub('W', process, 'Hello World')
'Hello Hello Worldorld'
>>> re.sub('o', process, 'Hello World')
'HellHello World WHello Worldrld'

Here's a simplified pure-Python equivalent of what you are doing:

def replace_with_match_string(target, s):
n = s.find(target)
if n != -1:
s = s[:n] + s + s[n+len(target):]
return s

> If we try to get the same result using a replacement string, instead of
> a function, the strange behaviour cannot be reproduced: re.sub(r'123',
> '123', '123')
> '123'
>
> re.sub('123', '123', '123\n')
> '123\n'

The regex "123" matches the first three characters of "123\n", which is
then replaced by "123", giving "123\n", exactly as expected.

>>> re.sub("o", "123", "Hello World")
'Hell123 W123rld'

--
Steven

From: Thomas Jollans on 6 Jul 2010 13:32

On 07/06/2010 07:10 PM, Javier Collado wrote:
> Hello,
>
> Let's imagine that we have a simple function that generates a
> replacement for a regular expression:
>
> def process(match):
> return match.string
>
> If we use that simple function with re.sub using a simple pattern and
> a string we get the expected output:
> re.sub('123', process, '123')
> '123'
>
> However, if the string passed to re.sub contains a trailing new line
> character, then we get an extra new line character unexpectedly:
> re.sub(r'123', process, '123\n')
> '123\n\n'

process returns match.string, which is, according to the docs:

"""The string passed to match() or search()"""

You passed "123\n" to sub(), which may not be explicitly listed here,
but there's no difference. Process correctly returns "123\n", which is
inserted. Let me demonstrate again with a longer string:

>>> import re
>>> def process(match):
.... return match.string
....
>>> re.sub(r'\d+', process, "start,123,end")
'start,start,123,end,end'
>>>

>
> If we try to get the same result using a replacement string, instead
> of a function, the strange behaviour cannot be reproduced:
> re.sub(r'123', '123', '123')
> '123'
>
> re.sub('123', '123', '123\n')
> '123\n'

Again, the behaviour is correct: you're not asking for "whatever was
passed to sub()", but for '123', and that's what you're getting.

>
> Is there any explanation for this? If I'm skipping something when
> using a replacement function with re.sub, please let me know.

What you want is grouping:

>>> def process(match):
.... return "<<" + match.group(1) + ">>"
....
>>> re.sub(r'(\d+)', process, "start,123,end")
'start,<<123>>,end'
>>>

or better, without a function:

>>> re.sub(r'(\d+)', r'<<\1>>', "start,123,end")
'start,<<123>>,end'
>>>

Cheers,
Thomas

From: Javier Collado on 6 Jul 2010 13:58

Thanks for your answers. They helped me to realize that I was
mistakenly using match.string (the whole string) when I should be
using math.group(0) (the whole match).

Best regards,
Javier

|
Pages: 1
Prev: Download Microsoft C/C++ compiler for use with Python 2.6/2.7ASAP
Next: automatic adding files to an archive in distutils