Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: Daniel DeLorme on 4 Dec 2007 20:58

MonkeeSage wrote:
> I guess we were talking about different things then. I never meant to
> imply that the regexp engine can't match unicode characters

Since regular expressions are embedded in the very syntax of ruby just
as arrays and hashes, IMHO that qualifies as unicode support. So yeah,
it seems like we have a semantic disagreement. :-(

> I, like Charles (and I think most people), was referring to the
> ability to index into strings by characters, find their lengths in
> characters

That is certainly *one* way of supporting unicodde but by no means the
only way. My belief is that you can do most string manipulations in a
way that obviates the need for char indexing & char length, if only you
change your mindset from "operating on individual characters" to
"operating on the string as a whole". And since regex are a specialized
language for string manipulation, they're also a lot faster. It's a
little like imperative vs functional programming; if I told you about a
programming language that has no variable assignments you might think
it's completely broken, and yet that's how functional languages work.

> to compose and decompose composite characters, to
> normalize characters, convert them to other encodings like shift-jis,
> and other such things.

Converting encodings is a worthy goal but unrelated to unicode support.
As for character [de]composition that would be a very nice thing to have
if it was handled automatically (e.g. "a\314\200"=="\303\240") but if
the programmer has to worry about it then you might as well leave it to
a specialized library. Well, it's not like ruby lets us abstract away
composite characters either in 1.8 or 1.9... I never claimed unicode
support was 100%, just good enough for most needs.

> just a difference of opinion. I don't mind being wrong (happens a
> lot! ;) I just don't like being accused of spreading FUD about ruby,
> which to my mind implies malice of forethought rather that simply
> mistake.

Yes, that was too harsh on my part. My apologies.

Daniel

From: Daniel DeLorme on 4 Dec 2007 21:03

Daniel DeLorme wrote:
> Heavy compared to what? Once compiled, regex are orders of magnitude
> faster than jumping in and out of ruby interpreted code.

Sorry to beat a dead horse, but I just did an interesting little
experiment with 1.9:

>> str = "abcde"*1000
>> str.encoding
=> <Encoding:ASCII-8BIT>
>> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
=> 0.010282039642334
>> str.force_encoding 'utf-8'
>> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
=> 1.29934501647949
>> arr = str.scan(/./u)
>> Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real
=> 0.00343608856201172

indexing into UTF-8 strings is *expensive*

Daniel

From: Charles Oliver Nutter on 5 Dec 2007 02:34

Daniel DeLorme wrote:
> Daniel DeLorme wrote:
>> Heavy compared to what? Once compiled, regex are orders of magnitude
>> faster than jumping in and out of ruby interpreted code.
>
> Sorry to beat a dead horse, but I just did an interesting little
> experiment with 1.9:
>
> >> str = "abcde"*1000
> >> str.encoding
> => <Encoding:ASCII-8BIT>
> >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
> => 0.010282039642334
> >> str.force_encoding 'utf-8'
> >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
> => 1.29934501647949
> >> arr = str.scan(/./u)
> >> Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real
> => 0.00343608856201172
>
> indexing into UTF-8 strings is *expensive*

...but correct. I'd rather have correct than broken.

- Charlie

From: MonkeeSage on 5 Dec 2007 06:01

On Dec 4, 7:58 pm, Daniel DeLorme <dan...(a)dan42.com> wrote:
> MonkeeSage wrote:
> > I guess we were talking about different things then. I never meant to
> > imply that the regexp engine can't match unicode characters
>
> Since regular expressions are embedded in the very syntax of ruby just
> as arrays and hashes, IMHO that qualifies as unicode support. So yeah,
> it seems like we have a semantic disagreement. :-(
>
> > I, like Charles (and I think most people), was referring to the
> > ability to index into strings by characters, find their lengths in
> > characters
>
> That is certainly *one* way of supporting unicodde but by no means the
> only way. My belief is that you can do most string manipulations in a
> way that obviates the need for char indexing & char length, if only you
> change your mindset from "operating on individual characters" to
> "operating on the string as a whole". And since regex are a specialized
> language for string manipulation, they're also a lot faster. It's a
> little like imperative vs functional programming; if I told you about a
> programming language that has no variable assignments you might think
> it's completely broken, and yet that's how functional languages work.

I think we'll just have to agree to disagree. But there is one
point...

main = do
let i_like = "I like "
putStrLn $ i_like ++ haskell
where haskell = "a functional language"

;)

> > to compose and decompose composite characters, to
> > normalize characters, convert them to other encodings like shift-jis,
> > and other such things.
>
> Converting encodings is a worthy goal but unrelated to unicode support.
> As for character [de]composition that would be a very nice thing to have
> if it was handled automatically (e.g. "a\314\200"=="\303\240") but if
> the programmer has to worry about it then you might as well leave it to
> a specialized library. Well, it's not like ruby lets us abstract away
> composite characters either in 1.8 or 1.9... I never claimed unicode
> support was 100%, just good enough for most needs.
>
> > just a difference of opinion. I don't mind being wrong (happens a
> > lot! ;) I just don't like being accused of spreading FUD about ruby,
> > which to my mind implies malice of forethought rather that simply
> > mistake.
>
> Yes, that was too harsh on my part. My apologies.

No worries. :) I apologize as well for responding by saying you were
lying about unicode support; I see that we just have a difference of
opinion and were talking past each another.

> Daniel

Regards,
Jordan

From: marc on 5 Dec 2007 16:35

Daniel DeLorme said...
> MonkeeSage wrote:
> > Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).
>
> I enrages me to see this kind of FUD. Through regular expressions, ruby
> 1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support
> well-near 100% complete.
>
>
> > Everything in ruby is a bytestring.
>
> YES! And that's exactyly how it should be. Who is it that spread the
> flawed idea that strings are fundamentally made of characters?

Are you being ironic?

> I'd like
> to slap him around a little. Fundamentally, ever since the word "string"
> was applied to computing, strings were made of 8-BIT CHARS, not n-bit
> characters. If only the creators of C has called that datatype "byte"
> instead of "char" it would have saved us so many misunderstandings.

And look at the trouble we're having ditching the waterfall method, all
because someone misread a paper in the 1700s or thereabouts.

You might want to spar with Tim Bray from Sun who presented at RubyConf
2006, where his slides state:

"99.99999% of the time, programmers want to deal with characters not
bytes. I know of one exception: running a state machine on UTF8-encoded
text. This is done by the Expat XML parser."

"In 2006, programmers around the world expect that, in modern languages,
strings are Unicode and string APIs provide Unicode semantics correctly
& efficiently, by default. Otherwise, they perceive this as an offense
against their language and their culture. Humanities/computing academics
often need to work outside Unicode. Few others do."

He reviews his chat here:

http://www.tbray.org/ongoing/When/200x/2006/10/22/Unicode-and-Ruby

and the slides are here:

http://www.tbray.org/talks/rubyconf2006.pdf

--
Cheers,
Marc

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem