Raw string substitution problem [Python]

Prev: Odd json encoding erro
Next: Subclassing RegexObject

From: Alan G Isaac on 17 Dec 2009 12:54

> Alan G Isaac<alan.isaac(a)gmail.com> wrote:
>> >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 'a\\nb\\n.c\\a','123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg')
>> True
>> Why are the first two strings being treated as if they are the last one?

On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
> They aren't. The last string is different.

Of course it is different.
That is the basis of my question.
Why is it being treated as if it is the same?
(See the end of this post.)

> Alan G Isaac<alan.isaac(a)gmail.com> wrote:
>> More simply, consider::
>>
>> >>> re.sub('abc', '\\', '123abcdefg')
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in<module>
>> File "C:\Python26\lib\re.py", line 151, in sub
>> return _compile(pattern, 0).sub(repl, string, count)
>> File "C:\Python26\lib\re.py", line 273, in _subx
>> template = _compile_repl(template, pattern)
>> File "C:\Python26\lib\re.py", line 260, in _compile_repl
>> raise error, v # invalid expression
>> sre_constants.error: bogus escape (end of line)
>>
>> Why is this the proper handling of what one might think would be an
>> obvious substitution?

On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
> Is this what you want? What you have is a re expression consisting of
> a single backslash that doesn't escape anything (EOL) so it barfs.
>>>> re.sub('abc', r'\\', '123abcdefg')
> '123\\defg'

Turning again to the documentation:
"if it is a string, any backslash escapes in it are processed.
That is, \n is converted to a single newline character, \r is
converted to a linefeed, and so forth."
So why is '\n' converted to a newline but '\\' does not become a literal
backslash? OK, I don't do much string processing, so perhaps this is where
I am missing the point: how is the replacement being "converted"?
(As Peter's example shows, if you supply the replacement via
a function, this does not happen.) You suggest it is just a matter of
it being an re, but::

>>> re.sub('abc', 'a\\nc','1abcd') == re.sub('abc', 'a\nc','1abcd')
True
>>> re.compile('a\\nc') == re.compile('a\nc')
False

So I have two string that are not the same, nor do they compile
equivalently, yet apparently they are "converted" to something
equivalent for the substitution. Why? Is my question clearer?

If the answer looks too obvious to state, assume I'm missing it anyway
and please state it. As I said, I seldom use the re module.

Alan Isaac

From: MRAB on 17 Dec 2009 14:45

Alan G Isaac wrote:
>> Alan G Isaac<alan.isaac(a)gmail.com> wrote:
>>> >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') ==
>>> re.sub('abc', 'a\\nb\\n.c\\a','123abcdefg') == re.sub('abc',
>>> 'a\nb\n.c\a','123abcdefg')
>>> True
>>> Why are the first two strings being treated as if they are the last one?
>
>
> On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
>> They aren't. The last string is different.
>
> Of course it is different.
> That is the basis of my question.
> Why is it being treated as if it is the same?
> (See the end of this post.)
>
>
>> Alan G Isaac<alan.isaac(a)gmail.com> wrote:
>>> More simply, consider::
>>>
>>> >>> re.sub('abc', '\\', '123abcdefg')
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in<module>
>>> File "C:\Python26\lib\re.py", line 151, in sub
>>> return _compile(pattern, 0).sub(repl, string, count)
>>> File "C:\Python26\lib\re.py", line 273, in _subx
>>> template = _compile_repl(template, pattern)
>>> File "C:\Python26\lib\re.py", line 260, in _compile_repl
>>> raise error, v # invalid expression
>>> sre_constants.error: bogus escape (end of line)
>>>
>>> Why is this the proper handling of what one might think would be an
>>> obvious substitution?
>
>
> On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
>> Is this what you want? What you have is a re expression consisting of
>> a single backslash that doesn't escape anything (EOL) so it barfs.
> >>>> re.sub('abc', r'\\', '123abcdefg')
> > '123\\defg'
>
>
> Turning again to the documentation:
> "if it is a string, any backslash escapes in it are processed.
> That is, \n is converted to a single newline character, \r is
> converted to a linefeed, and so forth."
> So why is '\n' converted to a newline but '\\' does not become a literal
> backslash? OK, I don't do much string processing, so perhaps this is where
> I am missing the point: how is the replacement being "converted"?
> (As Peter's example shows, if you supply the replacement via
> a function, this does not happen.) You suggest it is just a matter of
> it being an re, but::
>
> >>> re.sub('abc', 'a\\nc','1abcd') == re.sub('abc', 'a\nc','1abcd')
> True
> >>> re.compile('a\\nc') == re.compile('a\nc')
> False
>
> So I have two string that are not the same, nor do they compile
> equivalently, yet apparently they are "converted" to something
> equivalent for the substitution. Why? Is my question clearer?
>
re.compile('a\\nc') _does_ compile to the same as regex as
re.compile('a\nc').

However, regex objects never compare equal to each other, so, strictly
speaking, re.compile('a\nc') != re.compile('a\nc').

However, having said that, the re module contains a cache (keyed on the
string and options supplied), so the first re.compile('a\nc') will put
the regex object in the cache and the second re.compile('a\nc') will
return that same regex object from the cache. If you clear the cache in
between the two calls (do re._cache.clear()) you'll get two different
regex objects which won't compare equal even though they are to all
intents identical.

> If the answer looks too obvious to state, assume I'm missing it anyway
> and please state it. As I said, I seldom use the re module.
>

From: Alan G Isaac on 17 Dec 2009 15:18

On 12/17/2009 2:45 PM, MRAB wrote:
> re.compile('a\\nc') _does_ compile to the same as regex as
> re.compile('a\nc').
>
> However, regex objects never compare equal to each other, so, strictly
> speaking, re.compile('a\nc') != re.compile('a\nc').
>
> However, having said that, the re module contains a cache (keyed on the
> string and options supplied), so the first re.compile('a\nc') will put
> the regex object in the cache and the second re.compile('a\nc') will
> return that same regex object from the cache. If you clear the cache in
> between the two calls (do re._cache.clear()) you'll get two different
> regex objects which won't compare equal even though they are to all
> intents identical.

OK, this is helpful.
(I did check equality but did not understand
I got True only because re used caching.)
So is the bottom line the following?
A string replacement is not just "converted"
as described in the documentation, essentially
it is compiled?

But that cannot quite be right. E.g., \b will be a back
space not a word boundary. So then the question arises
again, why isn't '\\' a backslash? Just because?
Why does it not get the "obvious" conversion?

Thanks,
Alan Isaac

From: MRAB on 17 Dec 2009 15:51

Alan G Isaac wrote:
> On 12/17/2009 2:45 PM, MRAB wrote:
>> re.compile('a\\nc') _does_ compile to the same as regex as
>> re.compile('a\nc').
>>
>> However, regex objects never compare equal to each other, so, strictly
>> speaking, re.compile('a\nc') != re.compile('a\nc').
>>
>> However, having said that, the re module contains a cache (keyed on the
>> string and options supplied), so the first re.compile('a\nc') will put
>> the regex object in the cache and the second re.compile('a\nc') will
>> return that same regex object from the cache. If you clear the cache in
>> between the two calls (do re._cache.clear()) you'll get two different
>> regex objects which won't compare equal even though they are to all
>> intents identical.
>
>
> OK, this is helpful.
> (I did check equality but did not understand
> I got True only because re used caching.)
> So is the bottom line the following?
> A string replacement is not just "converted"
> as described in the documentation, essentially
> it is compiled?
>
> But that cannot quite be right. E.g., \b will be a back
> space not a word boundary. So then the question arises
> again, why isn't '\\' a backslash? Just because?
> Why does it not get the "obvious" conversion?
>
If you give the re module a string containing \b, eg. '\\b' or r'\b',
then it will compile it to a word boundary if it's in a regex string or
a backspace if it's in a replacement string. This is different from
giving the re module a string which actually contains a backspace, eg,
'\b'.

Because the re module uses backslashes for escaping, you'll need to
escape a literal backslash with a backslash in the string you give it.
But string literals also use backslashes for escaping, so you'll need to
escape each of those backslashes with a backslash.

From: Rhodri James on 17 Dec 2009 19:59

On Thu, 17 Dec 2009 20:18:12 -0000, Alan G Isaac <alan.isaac(a)gmail.com>
wrote:

> So is the bottom line the following?
> A string replacement is not just "converted"
> as described in the documentation, essentially
> it is compiled?

That depends entirely on what you mean.

> But that cannot quite be right. E.g., \b will be a back
> space not a word boundary. So then the question arises
> again, why isn't '\\' a backslash? Just because?
> Why does it not get the "obvious" conversion?

'\\' *is* a backslash. That string containing a single backslash is then
processed by the re module which sees a backslash, tries to interpret it
as an escape, fails and barfs.

"re.compile('a\\nc')" passes a sequence of four characters to re.compile:
'a', '\', 'n' and 'c'. re.compile() then does it's own interpretation:
'a' passes through as is, '\' flags an escape which combined with 'n'
produces the newline character (0x0a), and 'c' passes through as is.

"re.compile('a\nc')" by contrast passes a sequence of three character to
re.compile: 'a', 0x0a and 'c'. re.compile() does it's own interpretation,
which happens not to change any of the characters, resulting in the same
regular expression as before.

Your problem is that you are conflating the compile-time processing of
string literals with the run-time processing of strings specific to re.

--
Rhodri James *-* Wildebeeste Herder to the Masses

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Odd json encoding erro
Next: Subclassing RegexObject