Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: Daniel DeLorme on 5 Dec 2007 19:15

marc wrote:
> Daniel DeLorme said...
>> MonkeeSage wrote:
>>> Everything in ruby is a bytestring.
>> YES! And that's exactyly how it should be. Who is it that spread the
>> flawed idea that strings are fundamentally made of characters?
>
> Are you being ironic?

Not at all. By "fundamentally" I mean the fundamental, lowest level of
representation. If strings were fundamentally made of characters then we
wouldn't be able to access individual bytes because that's a lower level
than the fundamental level, which is by definition impossible.

If you are using UCS2 it makes sense to consider strings as arrays of
characters because that's what they are. But UTF8 strings do not follow
the characteristics of arrays at all. Each access into the "array" is
O(n) rather than O(1). So IMHO treating it as an array of characters is
a *very* leaky abstraction.

I agree that 99.9% of the time you want to deal with characters, and I
believe that in 99% of those cases you would be better served with regex
than this pretend "array" disguise.

Daniel

From: MonkeeSage on 5 Dec 2007 20:23

On Dec 5, 6:15 pm, Daniel DeLorme <dan...(a)dan42.com> wrote:
> marc wrote:
> > Daniel DeLorme said...
> >> MonkeeSage wrote:
> >>> Everything in ruby is a bytestring.
> >> YES! And that's exactyly how it should be. Who is it that spread the
> >> flawed idea that strings are fundamentally made of characters?
>
> > Are you being ironic?
>
> Not at all. By "fundamentally" I mean the fundamental, lowest level of
> representation. If strings were fundamentally made of characters then we
> wouldn't be able to access individual bytes because that's a lower level
> than the fundamental level, which is by definition impossible.
>
> If you are using UCS2 it makes sense to consider strings as arrays of
> characters because that's what they are. But UTF8 strings do not follow
> the characteristics of arrays at all. Each access into the "array" is
> O(n) rather than O(1). So IMHO treating it as an array of characters is
> a *very* leaky abstraction.
>
> I agree that 99.9% of the time you want to deal with characters, and I
> believe that in 99% of those cases you would be better served with regex
> than this pretend "array" disguise.
>
> Daniel

Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).

$ ruby19 -v
ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux]

$ echo && cat bench.rb
#!/usr/bin/ruby19
# -*- coding: ascii -*-

require "benchmark"
require "test/unit/assertions"
include Test::Unit::Assertions

$KCODE = "u"

$target = "!$BF|K\8l(B!" * 100
$unichr = "$BK\(B".force_encoding('utf-8')
$regchr = /[$BK\(B]/u

def uni_split
$target.split($unichr)
end
def reg_split
$target.split($regchr)
end

def uni_index
$target.index($unichr)
end
def reg_index
$target =~ $regchr
end

def uni_chars
$target.length
end
def reg_chars
$target.unpack("U*").length
# this is *alot* slower
# $target.scan(/./u).length
end

$target.force_encoding("ascii")
a = reg_split
$target.force_encoding("utf-8")
b = uni_split
assert_equal(a.length, b.length)

$target.force_encoding("ascii")
a = reg_index
$target.force_encoding("utf-8")
b = uni_index
assert_equal(a-2, b)

$target.force_encoding("ascii")
a = reg_chars
$target.force_encoding("utf-8")
b = uni_chars
assert_equal(a, b)

n = 10_000
Benchmark.bm(12) { | x |
$target.force_encoding("ascii")
x.report("reg_split") { n.times { reg_split } }
$target.force_encoding("utf-8")
x.report("uni_split") { n.times { uni_split } }
puts
$target.force_encoding("ascii")
x.report("reg_index") { n.times { reg_index } }
$target.force_encoding("utf-8")
x.report("uni_index") { n.times { uni_index } }
puts
$target.force_encoding("ascii")
x.report("reg_chars") { n.times { reg_chars } }
$target.force_encoding("utf-8")
x.report("uni_chars") { n.times { uni_chars } }
}

====

With caches initialized, an 5 prior runs, I got these numbers:

$ ruby19 bench.rb
user system total real
reg_split 2.550000 0.010000 2.560000 ( 2.799292)
uni_split 1.820000 0.020000 1.840000 ( 2.026265)

reg_index 0.040000 0.000000 0.040000 ( 0.097672)
uni_index 0.150000 0.000000 0.150000 ( 0.202700)

reg_chars 0.790000 0.010000 0.800000 ( 0.919995)
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)

====

So String#=~ with a bytestring and unicode regexp is faster than
String#index by a fator or ~0.5. In the other two cases, the opposite
is true.

Ps. BTW, in case there is any confusion, bytestrings aren't going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default. You can also explicitly
decode from utf-8 back to ascii. and you can get a byte enumerator (or
array from calling to_a on the enumerator) from String#bytes, and an
iterator from #each_byte, irregardless of the encoding.

Regards,
Jordan

From: Daniel DeLorme on 5 Dec 2007 21:31

MonkeeSage wrote:
> Here is a micro-benchmark on three common string operations (split,
> index, length), using bytestrings and unicode regexp, verses native
> utf-8 strings in 1.9.0 (release).

That's nice, but split and index do not operate using integer indexing
into the string, so they are rather irrelevant to the topic at hand.
They produce the same results in ruby1.8, i.e. uni_split==reg_split and
uni_index==reg_index.

I also stated that the point of regex manipulation is to *obviate* the
need for methods like index and length. So a more accurate benchmark
might be something like:
reg_chars N/A N/A N/A ( N/A )
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)
;-)

> Ps. BTW, in case there is any confusion, bytestrings aren't going
> away; you can, as you see above, specify a magic encoding comment to
> ensure that you have bytestrings by default.

Yes, it's still possible to access bytes but it's not possible to run a
utf8 regex on a bytestring if it contains extended characters:

$ ruby1.9 -ve '"abc" =~ /b/u'
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
$ ruby1.9 -ve '"$BF|K\8l(B" =~ /$BK\(B/u'
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
-e:1:in `<main>': character encodings differ (ArgumentError)

And that kinda kills my whole approach.

Daniel

From: MonkeeSage on 5 Dec 2007 22:07

On Dec 5, 8:31 pm, Daniel DeLorme <dan...(a)dan42.com> wrote:
> MonkeeSage wrote:
> > Here is a micro-benchmark on three common string operations (split,
> > index, length), using bytestrings and unicode regexp, verses native
> > utf-8 strings in 1.9.0 (release).
>
> That's nice, but split and index do not operate using integer indexing
> into the string, so they are rather irrelevant to the topic at hand.

Heh, if the topic at hand is only that indexing into a string is
slower with native utf-8 strings (don't disagree), then I guess it's
irrelevant. ;) Regarding the idea that you can do everything just as
efficiently with regexps that you can do with native utf-8
encoding...it seems relevant. In other words, it goes to show a
general behavior that is benefited by a native implementation (the
same reason we're using native hashes rather than building our own
implementations out of arrays of pairs).

> They produce the same results in ruby1.8, i.e. uni_split==reg_split and
> uni_index==reg_index.

Yes. My point was to show how a native implementation of unicode
strings effects performance compared to using regular expressions on
bytestrings. The behavior should be the same (hence the asserts).

> I also stated that the point of regex manipulation is to *obviate* the
> need for methods like index and length. So a more accurate benchmark
> might be something like:
> reg_chars N/A N/A N/A ( N/A )
> uni_chars 0.130000 0.000000 0.130000 ( 0.193307)
> ;-)

Someone just posted a question today about how to printf("%20s ...",
a, ...) when "a" contains unicode (it screws up the alignment since
printf only counts byte width, not character width). There is no
*elegant* solution in 1.8., regexps or otherwise. There are haskish
solutions (I provided one in that thread)...but the need was still
there. Another example is GtkTextView widgets from ruby-gtk2. They
deal with utf-8 in their C backend. So all the cursor functions that
deal with characters mean utf-8 characters, not bytestrings. So
without kludges, stuff doesn't always work right.

> > Ps. BTW, in case there is any confusion, bytestrings aren't going
> > away; you can, as you see above, specify a magic encoding comment to
> > ensure that you have bytestrings by default.
>
> Yes, it's still possible to access bytes but it's not possible to run a
> utf8 regex on a bytestring if it contains extended characters:
>
> $ ruby1.9 -ve '"abc" =~ /b/u'
> ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
> $ ruby1.9 -ve '"$BF|K\8l(B" =~ /$BK\(B/u'
> ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
> -e:1:in `<main>': character encodings differ (ArgumentError)
>
> And that kinda kills my whole approach.

You can't use mixed encodings (not just in regexps, not anywhere).
You'd have to use a proposed-but-not-implemented-in-1.9.0-release,
command line switch to set your encoding to ascii (or whatever), or
else use a magic comment [1] like I did above. That or explicitly
encode both objects in the same encoding.

> Daniel

Regards,
Jordan

[1] http://www.ruby-forum.com/topic/127831

From: Daniel DeLorme on 6 Dec 2007 00:29

MonkeeSage wrote:
> Heh, if the topic at hand is only that indexing into a string is
> slower with native utf-8 strings (don't disagree), then I guess it's
> irrelevant. ;) Regarding the idea that you can do everything just as
> efficiently with regexps that you can do with native utf-8
> encoding...it seems relevant.

How so? These methods work just as well in ruby1.8 which does *not* have
native utf8 encoding embedded in the strings. Of course, comparing a
string with a string is more efficient than comparing a string with a
regexp, but that is irrelevant of whether the string has "native" utf8
encoding or not:

$ ruby1.8 -rbenchmark -KU
puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index("$BK\(B") }}.real
puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/) }}.real
puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/u) }}.real
^D
0.225839138031006
0.304145097732544
0.313494920730591

$ ruby1.9 -rbenchmark -KU
puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index("$BK\(B") }}.real
puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/) }}.real
puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/u) }}.real
^D
0.183344841003418
0.255104064941406
0.263553857803345

1.9 is more performant (one would hope so!) but the performance ratio
between string comparison and regex comparison does not seem affected by
the encoding at all.

> Someone just posted a question today about how to printf("%20s ...",
> a, ...) when "a" contains unicode (it screws up the alignment since
> printf only counts byte width, not character width). There is no
> *elegant* solution in 1.8., regexps or otherwise.

It's not perfect in 1.9 either. "%20s" % "$BF|K\8l(B" results in a string of
20 characters... that uses 23 columns of terminal space because the font
for Japanese uses double width. In other words neither bytes nor
characters have an intrinsic "width" :-/

Daniel

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem