Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: MonkeeSage on 6 Dec 2007 02:02

On Dec 5, 11:29 pm, Daniel DeLorme <dan...(a)dan42.com> wrote:
> MonkeeSage wrote:
> > Heh, if the topic at hand is only that indexing into a string is
> > slower with native utf-8 strings (don't disagree), then I guess it's
> > irrelevant. ;) Regarding the idea that you can do everything just as
> > efficiently with regexps that you can do with native utf-8
> > encoding...it seems relevant.
>
> How so? These methods work just as well in ruby1.8 which does *not* have
> native utf8 encoding embedded in the strings. Of course, comparing a
> string with a string is more efficient than comparing a string with a
> regexp, but that is irrelevant of whether the string has "native" utf8
> encoding or not:
>
> $ ruby1.8 -rbenchmark -KU
> puts Benchmark.measure{100000.times{ "ÆüËÜ¸ì".index("ËÜ") }}.real
> puts Benchmark.measure{100000.times{ "ÆüËÜ¸ì".index(/[ËÜ]/) }}.real
> puts Benchmark.measure{100000.times{ "ÆüËÜ¸ì".index(/[ËÜ]/u) }}.real
> ^D
> 0.225839138031006
> 0.304145097732544
> 0.313494920730591
>
> $ ruby1.9 -rbenchmark -KU
> puts Benchmark.measure{100000.times{ "ÆüËÜ¸ì".index("ËÜ") }}.real
> puts Benchmark.measure{100000.times{ "ÆüËÜ¸ì".index(/[ËÜ]/) }}.real
> puts Benchmark.measure{100000.times{ "ÆüËÜ¸ì".index(/[ËÜ]/u) }}.real
> ^D
> 0.183344841003418
> 0.255104064941406
> 0.263553857803345
>
> 1.9 is more performant (one would hope so!) but the performance ratio
> between string comparison and regex comparison does not seem affected by
> the encoding at all.

Ok, I wasn't being clear. What I was trying to say is, yes the methods
perform the same on bytestrings -- whether using regex or standard
string operations. The problem is in their behavior, not performance
considered in the abstract. In 1.9, using ascii default encoding, this
bytestring acts just like 1.8:

"ÆüËÜ¸ì".index("ËÜ") #=> 3

That's fine! Faster than a regexp, no problems. That is, unless I want
to know where the character match is (for whatever reason -- take work-
necessitated interoperability with some software that required it).
For that I'd have to do something hackish and likely fragile. It's
possible, but not desirable; however, being able to do this gains
performance and ruby already does all the work for you:

"ÆüËÜ¸ì".force_encoding("utf-8").index("ËÜ".force_encoding("utf-8")) #=> 1

But it's obviously not better to type! But that's because I'm using
ascii default encoding. There is, as I understand it, going to be a
way to specify default encoding from the command-line, and probably
from within ruby, rather than just the magic comments and
String#force_encoding; so this extra typing is incidental and will go
away. Actually, it goes away right now if you use utf-8 default and
use the byte api to get at the underlying bytestrings.

> > Someone just posted a question today about how to printf("%20s ...",
> > a, ...) when "a" contains unicode (it screws up the alignment since
> > printf only counts byte width, not character width). There is no
> > *elegant* solution in 1.8., regexps or otherwise.
>
> It's not perfect in 1.9 either. "%20s" % "ÆüËÜ¸ì" results in a string of
> 20 characters... that uses 23 columns of terminal space because the font
> for Japanese uses double width. In other words neither bytes nor
> characters have an intrinsic "width" :-/
>
> Daniel

It works as expected in 1.9, you just have to set the right encoding:

printf("%20s\n".force_encoding("utf-8"),
"ni\xc3\xb1o".force_encoding("utf-8"))
#=> ni«Ðo

printf("%20s\n", "nino")
#=> ni«Ðo

Any case, I just don't think there is any reason to dislike the new
string api. It adds another tool to the toolbox. It doesn't make sense
to use it always, everywhere (like trying to make data that naturally
has the shape of an array, fit into a hash); but I see no reason to
try and cobble it together ourselves either (like building a hash api
from arrays ourselves). And with that, I'm going to sleep. Have to
think more on it tomorrow.

Peace,
Jordan

From: Jimmy Kofler on 7 Dec 2007 05:06

> Re: Unicode in Regex
> Posted by Jordan Callicoat (monkeesage) on 03.12.2007 02:50
>
> This seems to work...
>
> $KCODE = "UTF8"
> p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "J�sp...it works"
> # => 0
> ...
> However, it looks to me like it would be more robust to use a slightly
> modified version of UTF8REGEX (found in the link Jimmy posted
> above)...
>
> UTF8REGEX = /\A(?:
> [a-zA-Z\.\-\'\ ]
> | [\xC2-\xDF][\x80-\xBF]
> | \xE0[\xA0-\xBF][\x80-\xBF]
> | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
> | \xED[\x80-\x9F][\x80-\xBF]
> | \xF0[\x90-\xBF][\x80-\xBF]{2}
> | [\xF1-\xF3][\x80-\xBF]{3}
> | \xF4[\x80-\x8F][\x80-\xBF]{2}
> )*\z/mnx

Just to avoid confusion over the meaning of 'UTF8' in UTF8REGEX: the n
option sets the encoding of UTF8REGEX to none!

Cheers,

j. k.
--
Posted via http://www.ruby-forum.com/.

First | Prev |
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem