Prev: Odd json encoding erro
Next: Subclassing RegexObject
From: Alan G Isaac on 17 Dec 2009 10:08 > En Wed, 16 Dec 2009 11:09:32 -0300, Ed Keith <e_d_k(a)yahoo.com> escribió: > >> I am having a problem when substituting a raw string. When I do the >> following: >> >> re.sub('abc', r'a\nb\nc', '123abcdefg') >> >> I get >> >> """ >> 123a >> b >> cdefg >> """ >> >> what I want is >> >> r'123a\nb\ncdefg' On 12/16/2009 9:35 AM, Gabriel Genellina wrote: > From http://docs.python.org/library/re.html#re.sub > > re.sub(pattern, repl, string[, count]) > > ...repl can be a string or a function; if > it is a string, any backslash escapes in > it are processed. That is, \n is converted > to a single newline character, \r is > converted to a linefeed, and so forth. > > So you'll have to double your backslashes: I'm not persuaded that the docs are clear. Consider: >>> 'ab\\ncd' == r'ab\ncd' True Naturally enough. So I think the right answer is: 1. this is a documentation bug (i.e., the documentation fails to specify unexpected behavior for raw strings), or 2. this is a bug (i.e., raw strings are not handled correctly when used as replacements) I vote for 2. Peter's use of a function highlights just how odd this is: getting the raw string via a function produces a different result than providing it directly. If this is really the way things ought to be, I'd appreciate a clear explanation of why. Alan Isaac
From: Richard Brodie on 17 Dec 2009 11:24 "Alan G Isaac" <alan.isaac(a)gmail.com> wrote in message news:qemdnRUT0JvJ1LfWnZ2dnUVZ_vqdnZ2d(a)rcn.net... > Naturally enough. So I think the right answer is: > > 1. this is a documentation bug (i.e., the documentation > fails to specify unexpected behavior for raw strings), or > 2. this is a bug (i.e., raw strings are not handled correctly > when used as replacements) <neo> There is no raw string. </neo> A raw string is not a distinct type from an ordinary string in the same way byte strings and Unicode strings are. It is a merely a notation for constants, like writing integers in hexadecimal. >>> (r'\n', u'a', 0x16) ('\\n', u'a', 22)
From: Alan G Isaac on 17 Dec 2009 11:51 On 12/17/2009 11:24 AM, Richard Brodie wrote: > A raw string is not a distinct type from an ordinary string > in the same way byte strings and Unicode strings are. It > is a merely a notation for constants, like writing integers > in hexadecimal. > >>>> (r'\n', u'a', 0x16) > ('\\n', u'a', 22) Yes, that was a mistake. But the problem remains:: >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg') True >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a' False Why are the first two strings being treated as if they are the last one? That is, why isn't '\\' being processed in the obvious way? This still seems wrong. Why isn't it? More simply, consider:: >>> re.sub('abc', '\\', '123abcdefg') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python26\lib\re.py", line 151, in sub return _compile(pattern, 0).sub(repl, string, count) File "C:\Python26\lib\re.py", line 273, in _subx template = _compile_repl(template, pattern) File "C:\Python26\lib\re.py", line 260, in _compile_repl raise error, v # invalid expression sre_constants.error: bogus escape (end of line) Why is this the proper handling of what one might think would be an obvious substitution? Thanks, Alan Isaac
From: D'Arcy J.M. Cain on 17 Dec 2009 12:19 On Thu, 17 Dec 2009 11:51:26 -0500 Alan G Isaac <alan.isaac(a)gmail.com> wrote: > >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg') > True Was this a straight cut and paste or did you make a manual change? Is that leading space in the middle one a copying error? I get False for what you actually have there for obvious reasons. > >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a' > False > > Why are the first two strings being treated as if they are the last one? They aren't. The last string is different. >>> for x in (r'a\nb\n.c\a', 'a\\nb\\n.c\\a', 'a\nb\n.c\a'): print repr(x) .... 'a\\nb\\n.c\\a' 'a\\nb\\n.c\\a' 'a\nb\n.c\x07' > That is, why isn't '\\' being processed in the obvious way? > This still seems wrong. Why isn't it? What do you think is wrong? What would the "obvious" way of handling '//' be? > > More simply, consider:: > > >>> re.sub('abc', '\\', '123abcdefg') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "C:\Python26\lib\re.py", line 151, in sub > return _compile(pattern, 0).sub(repl, string, count) > File "C:\Python26\lib\re.py", line 273, in _subx > template = _compile_repl(template, pattern) > File "C:\Python26\lib\re.py", line 260, in _compile_repl > raise error, v # invalid expression > sre_constants.error: bogus escape (end of line) > > Why is this the proper handling of what one might think would be an > obvious substitution? Is this what you want? What you have is a re expression consisting of a single backslash that doesn't escape anything (EOL) so it barfs. >>> re.sub('abc', r'\\', '123abcdefg') '123\\defg' -- D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
From: MRAB on 17 Dec 2009 12:38
Alan G Isaac wrote: > On 12/17/2009 11:24 AM, Richard Brodie wrote: >> A raw string is not a distinct type from an ordinary string >> in the same way byte strings and Unicode strings are. It >> is a merely a notation for constants, like writing integers >> in hexadecimal. >> >>>>> (r'\n', u'a', 0x16) >> ('\\n', u'a', 22) > > > > Yes, that was a mistake. But the problem remains:: > > >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', > 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg') > True > >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a' > False > > Why are the first two strings being treated as if they are the last one? > That is, why isn't '\\' being processed in the obvious way? > This still seems wrong. Why isn't it? > > More simply, consider:: > > >>> re.sub('abc', '\\', '123abcdefg') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "C:\Python26\lib\re.py", line 151, in sub > return _compile(pattern, 0).sub(repl, string, count) > File "C:\Python26\lib\re.py", line 273, in _subx > template = _compile_repl(template, pattern) > File "C:\Python26\lib\re.py", line 260, in _compile_repl > raise error, v # invalid expression > sre_constants.error: bogus escape (end of line) > > Why is this the proper handling of what one might think would be an > obvious substitution? > Regular expressions and replacement strings have their own escaping mechanism, which also uses backslashes. Some of these regex escape sequences are the same as those of string literals, eg \n represents a newline; others are different, eg \b in a regex represents a word boundary and not a backspace as in a string literal. You can match a newline in a regex by either using an actual newline character ('\n' in a string literal) or an escape sequence ('\\n' or r'\n' in a string literal). If you want a regex to match an actual backslash followed by a letter 'n' then you need to escape the backslash in the regex and then either use a raw string literal or escape it again in a non-raw string literal. Match characters: <newline> Regex: \n Raw string literal: r'\n' Non-raw string literal: '\\n' Match characters: \n Regex: \\n Raw string literal: r'\\n' Non-raw string literal: '\\\\n' Replace with characters: <newline> Replacement: \n Raw string literal: r'\n' Non-raw string literal: '\\n' Replace with characters: \n Replacement: \\n Raw string literal: r'\\n' Non-raw string literal: '\\\\n' |