How do I set the encoding on a regexp ? [Ruby]

Prev: Performance of Ruby 1.9 vs. Ruby 1.8 (was: Speed sprint)
Next: Licensing for Ruby logo

From: Perry Smith on 23 Feb 2010 12:05

Title pretty much says it all. Here is a small sample program:

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

s = "string"
puts s.encoding
r = Regexp.new(s)
puts r.encoding

Here is the output:

UTF-8
US-ASCII

I was expecting both to be set to UTF-8. There is no force_encoding
method for RegExp.

If I later try to use it on strings of type UTF-8, it can throw an
exception.

How is this suppose to be handled?

Thanks,
Perry
--
Posted via http://www.ruby-forum.com/.

From: Roger Pack on 24 Feb 2010 12:56

> I was expecting both to be set to UTF-8. There is no force_encoding
> method for RegExp.
>
> If I later try to use it on strings of type UTF-8, it can throw an
> exception.

Do you have an example of this? It might be a bug.

I did notice that

Regexp.new("Café").encoding

keeps it in UTF-8

so maybe it's optimizing it and when it doesn't "have to be" UTF-8 it is
leaving it as ASCII?

-r
--
Posted via http://www.ruby-forum.com/.

From: Perry Smith on 24 Feb 2010 13:24

Roger Pack wrote:
>
>> I was expecting both to be set to UTF-8. There is no force_encoding
>> method for RegExp.
>>
>> If I later try to use it on strings of type UTF-8, it can throw an
>> exception.
>
> Do you have an example of this? It might be a bug.
>
> I did notice that
>
> Regexp.new("Café").encoding
>
> keeps it in UTF-8
>
> so maybe it's optimizing it and when it doesn't "have to be" UTF-8 it is
> leaving it as ASCII?

I'm not clear what you mean by an example other than what I put in the
original note.

I think I'm going to open a bug report -- it might not be a bug but I
sure am confused. The "Pick Axe" book describes a third argument but I
can't get that to work either. "ri" for Ruby 1.9.1 does not describe
the third argument at all -- but it does seem to exist at least.

It appears as if, as you pointed out, if the input string happens to be
ASCII, then the regexp encoding is ascii and there doesn't seem to be
anything you can do about it.

I'm testing on 1.9.1 p243.

But, due to another discussion thread, I think I want to be in 8 bit
binary anyway in my case. I'm not 100% positive my input is UTF-8. Its
suppose to be but I can't really trust it.

Thanks
Perry
--
Posted via http://www.ruby-forum.com/.

From: Robert Gleeson on 24 Feb 2010 13:36

Typo fix:
> Regexp.new(/foo/u).encoding # => UTF-8

--
Posted via http://www.ruby-forum.com/.

From: Roger Pack on 24 Feb 2010 13:58

>>> If I later try to use it on strings of type UTF-8, it can throw an
>>> exception.
> I'm not clear what you mean by an example other than what I put in the
> original note.

Do you have a small example (like your original) that throws an
exception where you "use it on strings later of type UTF-8" and it
throws an exception?

-r
--
Posted via http://www.ruby-forum.com/.

| Next | Last
Pages: 1 2 3 4
Prev: Performance of Ruby 1.9 vs. Ruby 1.8 (was: Speed sprint)
Next: Licensing for Ruby logo