Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: Greg Willits on 3 Dec 2007 14:47

Daniel DeLorme wrote:
> Greg Willits wrote:

>> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>>
>> I've boiled the experiments down to realizing I can't define a range
>> with \x

> Let me try to explain that in order to redeem myself from my previous
> angry post.

:-)

> Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> character U+00E4. And in a range expression, each escaped value is taken
> as one character within the range. Which results in not-immediately
> obvious situations:
>
> >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
> => []
> >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
> => ["é"]

OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
character code point -- which with your explanation I can finally tie
together what that means.

Took me a second to recognize the #{} as Ruby and not some new regex I'd
never seen :-P

And I realize now too I wasn't picking up on the use of octal vs
decimal.

Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?

> What is happening in the first case is that the string does not contain
> characters \303 or \251 because those are invalid utf8 sequences. But
> when the value "\303\251" is *inlined* into the regex, that is
> recognized as the utf8 character "é" and a match is found.
>
> So ranges *do* work in utf8 but you have to be careful:
>
> >> "àâäçèéêîïôü".scan(/[ä-î]/u)
> => ["ä", "ç", "è", "é", "ê", "î"]
> >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)
> => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> "\264", "\303", "\274"]
> >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
> => ["ä", "ç", "è", "é", "ê", "î"]
>
> Hope this helps.

Yes!

-- gw
--
Posted via http://www.ruby-forum.com/.

From: Greg Willits on 3 Dec 2007 16:07

>> Basically, \xE4 is counted as the byte value 0xE4, not the unicode
>> character U+00E4. And in a range expression, each escaped value is taken
>> as one character within the range. Which results in not-immediately
>> obvious situations:
>>
>> >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
>> => []
>> >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
>> => ["é"]

OK, one thing I'm still confused about -- when I look up é in any table,
it's DEC is 233 which converted to OCT is 351, yet you're using 251 (and
indeed it seems like reducing the OCTs I come up with by 100 is what
actually works).

Where is this 100 difference coming from?

-- gw

--
Posted via http://www.ruby-forum.com/.

From: Phrogz on 3 Dec 2007 17:19

On Dec 3, 12:47 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
> Daniel DeLorme wrote:
> > Greg Willits wrote:
> >> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
> >> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> >> I've boiled the experiments down to realizing I can't define a range
> >> with \x
> > Let me try to explain that in order to redeem myself from my previous
> > angry post.
>
> :-)
>
> > Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> > character U+00E4. And in a range expression, each escaped value is taken
> > as one character within the range. Which results in not-immediately
> > obvious situations:
>
> > >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
> > => []
> > >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
> > => ["é"]
>
> OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
> character code point -- which with your explanation I can finally tie
> together what that means.
>
> Took me a second to recognize the #{} as Ruby and not some new regex I'd
> never seen :-P
>
> And I realize now too I wasn't picking up on the use of octal vs
> decimal.
>
> Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?
>
>
>
> > What is happening in the first case is that the string does not contain
> > characters \303 or \251 because those are invalid utf8 sequences. But
> > when the value "\303\251" is *inlined* into the regex, that is
> > recognized as the utf8 character "é" and a match is found.
>
> > So ranges *do* work in utf8 but you have to be careful:
>
> > >> "àâäçèéêîïôü".scan(/[ä-î]/u)
> > => ["ä", "ç", "è", "é", "ê", "î"]
> > >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)
> > => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> > "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> > "\264", "\303", "\274"]
> > >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
> > => ["ä", "ç", "è", "é", "ê", "î"]
>
> > Hope this helps.
>
> Yes!
>
> -- gw
> --
> Posted viahttp://www.ruby-forum.com/.

From: Charles Oliver Nutter on 3 Dec 2007 19:31

Daniel DeLorme wrote:
> Usually the complaint about the support lack of unicode support is that
> something like "日本語".length returns 9 instead of 3, or that "日本語
> ".index("語") returns 6 instead of 2. It's nice that people want to
> completely redefine the API to return character positions and all that,
> but please don't complain that it's broken just because you happen to be
> using it incorrectly. Use the right tool for the job. SQL for database
> queries, non-home-brewed crypto libraries for security, regular
> expressions for string manipulation.
>
> I'm terribly sorry for the rant but I had to get it off my chest.

Regular expressions for all character work would be a *terribly* slow
way to get things done. If you want to get the nth character, should you
do a match for n-1 characters and a group to grab the nth? Or would it
be better if you could just index into the string and have it do the
right thing? How about if you want to iterate over all characters in a
string? Should the iterating code have to know about the encoding?
Should you use a regex to peel off one character at a time? Absurd.

Regex for string access goes a long way, but's just about the heaviest
way to do it. Strings should be aware of their encoding and should be
able to provide you access to characters as easily as bytes. That's what
1.9 (and upcoming changes in JRuby) fixes.

- Charlie

From: MonkeeSage on 3 Dec 2007 19:44

On Dec 2, 7:40 pm, Daniel DeLorme <dan...(a)dan42.com> wrote:
> Greg Willits wrote:
> > Greg Willits wrote:
>
> >> I'm expecting a validate_format_of with a regex like this
> >> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
> >> to allow many of the normal characters like ö é å to be submitted via
> >> web form. However, the extended characters are being rejected.
>
> > So, I've been pounding the web for info on UTF8 in Ruby and Rails the
> > past couple days to concoct some validations that allow UTF8
> > characters. I have discovered that I can get a little further by doing
> > the
> > following:
> > - declaring $KCODE = 'UTF8'
> > - adding /u to regex expressions.
>
> > The only thing not working now is the ability to define a range of \x
> > characters in a regex.
>
> > So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> > to have an ä in it. Perfect.
>
> > But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>
> > But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> > I've boiled the experiments down to realizing I can't define a range
> > with \x
>
> > Is this just one of those things that just doesn't work yet WRT Ruby/
> > Rails/UTF8, or is there another syntax? I've scoured all the regex
> > docs I can find, and they seem to indicate a range should work.
>
> Let me try to explain that in order to redeem myself from my previous
> angry post.
>
> Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> character U+00E4. And in a range expression, each escaped value is taken
> as one character within the range. Which results in not-immediately
> obvious situations:
>
> >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
> => []
> >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
> => ["é"]
>
> What is happening in the first case is that the string does not contain
> characters \303 or \251 because those are invalid utf8 sequences. But
> when the value "\303\251" is *inlined* into the regex, that is
> recognized as the utf8 character "é" and a match is found.
>
> So ranges *do* work in utf8 but you have to be careful:
>
> >> "àâäçèéêîïôü".scan(/[ä-î]/u)
> => ["ä", "ç", "è", "é", "ê", "î"]
> >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)
> => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> "\264", "\303", "\274"]
> >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
> => ["ä", "ç", "è", "é", "ê", "î"]
>
> Hope this helps.
>
> Dan

I missed your ranting.

Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS*
a standard mapping of bytes to *characters*. That's what unicode is.
I'm sorry you don't like that, but don't lie and say ruby 1.8 supports
unicode when it knows nothing about that standard mapping and treats
everything as individual bytes (and any byte with a value greater than
126 just prints an octal escape); and please don't accuse others of
spreading FUD when they state the facts.

Secondly, as I said in my first post to this thread, the characters
trying to be matched are composite characters, which requires you to
match both bytes. You can try to using a unicode regexp, but then you
run into the problem you mention--the regexp engine expects the pre-
composed, one-byte form...

"ò".scan(/[\303\262]/u) # => []
"ò".scan(/[\xf2]/u) # => ["\303\262"]

...which is why I said it's more robust to use something like the the
regexp that Jimmy linked to and I reposted, instead of a unicode
regexp.

Regards,
Jordan

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem