Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: Greg Willits on 30 Nov 2007 15:18

This is mostly a Ruby thing, and partly a Rails thing.

I'm expecting a validate_format_of with a regex like this

/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/

to allow many of the normal characters like ö é å to be submitted via
web form.

However, the extended characters are being rejected.

This works just fine though (which is just a-zA-Z)

/^[\x41-\x5A\x61-\x7A\.\'\-\ ]*?$/

It also seems to fail with full \x0000 numbers, is there limit at \xFF?

Some plain Ruby tests seem to suggest unicode characters don't work at
all??

p 'abvHgtwHFuG'.scan(/[a-z]/)
p 'abvHgtwHFuG'.scan(/[A-Z]/)
p 'abvHgtwHFuG'.scan(/[\x41-\x5A]/)
p 'abvHgtwHFuG'.scan(/[\x61-\x7A]/)
p 'aébvHögtåwHÅFuG'.scan(/[\xC0-\xD6\xD9-\xF6\xF9-\xFF]/)

["a", "b", "v", "g", "t", "w", "u"]
["H", "H", "F", "G"]
["H", "H", "F", "G"]
["a", "b", "v", "g", "t", "w", "u"]
["\303", "\303", "\303", "\303"]

So, what's the secret to using unicode character ranges in Ruby regex
(or Rails validations)?

--
def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end
--
Posted via http://www.ruby-forum.com/.

From: Dale Martenson on 30 Nov 2007 16:05

On Nov 30, 2:18 pm, Greg Willits <li...(a)gregwillits.ws> wrote:

> So, what's the secret to using unicode character ranges in Ruby regex
> (or Rails validations)?

Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
Ruby Conference. His presentation can be found at:

http://www.tbray.org/talks/rubyconf2006.pdf

He described how many member functions have trouble dealing with these
character sets. He made special reference to regular expressions.

--Dale

From: Greg Willits on 30 Nov 2007 17:00

Dale Martenson wrote:
> On Nov 30, 2:18 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
>
>> So, what's the secret to using unicode character ranges in Ruby regex
>> (or Rails validations)?
>
> Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
> Ruby Conference. His presentation can be found at:
>
> http://www.tbray.org/talks/rubyconf2006.pdf
>
> He described how many member functions have trouble dealing with these
> character sets. He made special reference to regular expressions.

That's just beyond sad.

I've been using Lasso for several years now, and *2003* it provided
complete support for Unicode. I know there's some esoterics it may not
deal with, but for all practical purposes we can round-trip data in
western and eastern languages with Lasso quite easily.

How can all these other languages be so far behind?

Pretty bad if I can't even allow Mr. Muños or Göran to enter their names
in a web form with proper server side validations. Aargh.

-- gw
--
Posted via http://www.ruby-forum.com/.

From: MonkeeSage on 1 Dec 2007 00:24

On Nov 30, 4:00 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
> Dale Martenson wrote:
> > On Nov 30, 2:18 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
>
> >> So, what's the secret to using unicode character ranges in Ruby regex
> >> (or Rails validations)?
>
> > Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
> > Ruby Conference. His presentation can be found at:
>
> >http://www.tbray.org/talks/rubyconf2006.pdf
>
> > He described how many member functions have trouble dealing with these
> > character sets. He made special reference to regular expressions.
>
> That's just beyond sad.
>
> I've been using Lasso for several years now, and *2003* it provided
> complete support for Unicode. I know there's some esoterics it may not
> deal with, but for all practical purposes we can round-trip data in
> western and eastern languages with Lasso quite easily.
>
> How can all these other languages be so far behind?
>
> Pretty bad if I can't even allow Mr. Muños or Göran to enter their names
> in a web form with proper server side validations. Aargh.
>
> -- gw
> --
> Posted viahttp://www.ruby-forum.com/.

Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).
Everything in ruby is a bytestring.

irb(main):001:0> 'aébvHögtåwHÅFuG'.scan(/./)
=> ["a", "\303", "\251", "b", "v", "H", "\303", "\266", "g", "t",
"\303", "\245", "w", "H", "\303", "\205", "F", "u", "G"]

So your character class is matching the first byte of the composite
characters (which is \303 in octal), and skipping the next (since it's
below the range). You probably want something like...

reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
'aébvHögtåwHÅFuG'.scan(reg)

irb(main):006:0* reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
=> /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
irb(main):007:0> 'aébvHögtåwHÅFuG'.scan(reg)
=> ["\303\251", "\303\266", "\303\245", "\303\205"]
irb(main):008:0> "å" == "\303\245"
=> true

Ps. I'm not entirely sure the value of the second character class is
right.

Regards,
Jordan

From: Jimmy Kofler on 1 Dec 2007 05:16

> Unicode in Regex
> Posted by Greg Willits (-gw-) on 30.11.2007 21:18
> This is mostly a Ruby thing, and partly a Rails thing.
>
> I'm expecting a validate_format_of with a regex like this
>
> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
>
> to allow many of the normal characters like ö é å to be submitted via
> web form.

How about the utf8 validation regex here:
http://snippets.dzone.com/posts/show/4527 ?
--
Posted via http://www.ruby-forum.com/.

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem