Raw string substitution problem [Python]

Prev: Odd json encoding erro
Next: Subclassing RegexObject

From: Alan G Isaac on 17 Dec 2009 10:08

> En Wed, 16 Dec 2009 11:09:32 -0300, Ed Keith <e_d_k(a)yahoo.com> escribió:
>
>> I am having a problem when substituting a raw string. When I do the
>> following:
>>
>> re.sub('abc', r'a\nb\nc', '123abcdefg')
>>
>> I get
>>
>> """
>> 123a
>> b
>> cdefg
>> """
>>
>> what I want is
>>
>> r'123a\nb\ncdefg'

On 12/16/2009 9:35 AM, Gabriel Genellina wrote:
> From http://docs.python.org/library/re.html#re.sub
>
> re.sub(pattern, repl, string[, count])
>
> ...repl can be a string or a function; if
> it is a string, any backslash escapes in
> it are processed. That is, \n is converted
> to a single newline character, \r is
> converted to a linefeed, and so forth.
>
> So you'll have to double your backslashes:

I'm not persuaded that the docs are clear. Consider:

>>> 'ab\\ncd' == r'ab\ncd'
True

Naturally enough. So I think the right answer is:

1. this is a documentation bug (i.e., the documentation
fails to specify unexpected behavior for raw strings), or
2. this is a bug (i.e., raw strings are not handled correctly
when used as replacements)

I vote for 2.

Peter's use of a function highlights just how odd this is:
getting the raw string via a function produces a different
result than providing it directly. If this is really the
way things ought to be, I'd appreciate a clear explanation
of why.

Alan Isaac

From: Richard Brodie on 17 Dec 2009 11:24

"Alan G Isaac" <alan.isaac(a)gmail.com> wrote in message
news:qemdnRUT0JvJ1LfWnZ2dnUVZ_vqdnZ2d(a)rcn.net...

> Naturally enough. So I think the right answer is:
>
> 1. this is a documentation bug (i.e., the documentation
> fails to specify unexpected behavior for raw strings), or
> 2. this is a bug (i.e., raw strings are not handled correctly
> when used as replacements)

<neo> There is no raw string. </neo>

A raw string is not a distinct type from an ordinary string
in the same way byte strings and Unicode strings are. It
is a merely a notation for constants, like writing integers
in hexadecimal.

>>> (r'\n', u'a', 0x16)
('\\n', u'a', 22)

From: Alan G Isaac on 17 Dec 2009 11:51

On 12/17/2009 11:24 AM, Richard Brodie wrote:
> A raw string is not a distinct type from an ordinary string
> in the same way byte strings and Unicode strings are. It
> is a merely a notation for constants, like writing integers
> in hexadecimal.
>
>>>> (r'\n', u'a', 0x16)
> ('\\n', u'a', 22)

Yes, that was a mistake. But the problem remains::

>>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg')
True
>>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a'
False

Why are the first two strings being treated as if they are the last one?
That is, why isn't '\\' being processed in the obvious way?
This still seems wrong. Why isn't it?

More simply, consider::

>>> re.sub('abc', '\\', '123abcdefg')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "C:\Python26\lib\re.py", line 273, in _subx
template = _compile_repl(template, pattern)
File "C:\Python26\lib\re.py", line 260, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)

Why is this the proper handling of what one might think would be an
obvious substitution?

Thanks,
Alan Isaac

From: D'Arcy J.M. Cain on 17 Dec 2009 12:19

On Thu, 17 Dec 2009 11:51:26 -0500
Alan G Isaac <alan.isaac(a)gmail.com> wrote:
> >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg')
> True

Was this a straight cut and paste or did you make a manual change? Is
that leading space in the middle one a copying error? I get False for
what you actually have there for obvious reasons.

> >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a'
> False
>
> Why are the first two strings being treated as if they are the last one?

They aren't. The last string is different.

>>> for x in (r'a\nb\n.c\a', 'a\\nb\\n.c\\a', 'a\nb\n.c\a'): print repr(x)
....
'a\\nb\\n.c\\a'
'a\\nb\\n.c\\a'
'a\nb\n.c\x07'

> That is, why isn't '\\' being processed in the obvious way?
> This still seems wrong. Why isn't it?

What do you think is wrong? What would the "obvious" way of handling
'//' be?
>
> More simply, consider::
>
> >>> re.sub('abc', '\\', '123abcdefg')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "C:\Python26\lib\re.py", line 151, in sub
> return _compile(pattern, 0).sub(repl, string, count)
> File "C:\Python26\lib\re.py", line 273, in _subx
> template = _compile_repl(template, pattern)
> File "C:\Python26\lib\re.py", line 260, in _compile_repl
> raise error, v # invalid expression
> sre_constants.error: bogus escape (end of line)
>
> Why is this the proper handling of what one might think would be an
> obvious substitution?

Is this what you want? What you have is a re expression consisting of
a single backslash that doesn't escape anything (EOL) so it barfs.

>>> re.sub('abc', r'\\', '123abcdefg')
'123\\defg'

--
D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

From: MRAB on 17 Dec 2009 12:38

Alan G Isaac wrote:
> On 12/17/2009 11:24 AM, Richard Brodie wrote:
>> A raw string is not a distinct type from an ordinary string
>> in the same way byte strings and Unicode strings are. It
>> is a merely a notation for constants, like writing integers
>> in hexadecimal.
>>
>>>>> (r'\n', u'a', 0x16)
>> ('\\n', u'a', 22)
>
>
>
> Yes, that was a mistake. But the problem remains::
>
> >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc',
> 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg')
> True
> >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a'
> False
>
> Why are the first two strings being treated as if they are the last one?
> That is, why isn't '\\' being processed in the obvious way?
> This still seems wrong. Why isn't it?
>
> More simply, consider::
>
> >>> re.sub('abc', '\\', '123abcdefg')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "C:\Python26\lib\re.py", line 151, in sub
> return _compile(pattern, 0).sub(repl, string, count)
> File "C:\Python26\lib\re.py", line 273, in _subx
> template = _compile_repl(template, pattern)
> File "C:\Python26\lib\re.py", line 260, in _compile_repl
> raise error, v # invalid expression
> sre_constants.error: bogus escape (end of line)
>
> Why is this the proper handling of what one might think would be an
> obvious substitution?
>
Regular expressions and replacement strings have their own escaping
mechanism, which also uses backslashes.

Some of these regex escape sequences are the same as those of string
literals, eg \n represents a newline; others are different, eg \b in a
regex represents a word boundary and not a backspace as in a string
literal.

You can match a newline in a regex by either using an actual newline
character ('\n' in a string literal) or an escape sequence ('\\n' or
r'\n' in a string literal). If you want a regex to match an actual
backslash followed by a letter 'n' then you need to escape the backslash
in the regex and then either use a raw string literal or escape it again
in a non-raw string literal.

Match characters: <newline>
Regex: \n
Raw string literal: r'\n'
Non-raw string literal: '\\n'

Match characters: \n
Regex: \\n
Raw string literal: r'\\n'
Non-raw string literal: '\\\\n'

Replace with characters: <newline>
Replacement: \n
Raw string literal: r'\n'
Non-raw string literal: '\\n'

Replace with characters: \n
Replacement: \\n
Raw string literal: r'\\n'
Non-raw string literal: '\\\\n'

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Odd json encoding erro
Next: Subclassing RegexObject