From: David Springer on
Perry Smith wrote:

> r = Regexp.new(s)

Try this:

r = Regexp.new(s,16)

-David

--
Posted via http://www.ruby-forum.com/.

From: Perry Smith on
Roger Pack wrote:
>>>> If I later try to use it on strings of type UTF-8, it can throw an
>>>> exception.
>> I'm not clear what you mean by an example other than what I put in the
>> original note.
>
>
> Do you have a small example (like your original) that throws an
> exception where you "use it on strings later of type UTF-8" and it
> throws an exception?

No I don't. I *think* that I might have had a string that was not
utf-8. I was fetching strings from a file and just doing a
force_encoding because they were suppose to be utf-8 but maybe they were
not.

I'm not sure. Let me see if I can make an example. My trivial examples
so far don't throw an exception.
--
Posted via http://www.ruby-forum.com/.

From: Brian Candler on
Perry Smith wrote:
> I think I'm going to open a bug report -- it might not be a bug but I
> sure am confused.

It's not a bug(*), and it sure is confusing. My own attempt to document
Ruby 1.9's encoding rules, which is woefully incomplete but covers about
200 different cases, is at
http://github.com/candlerb/string19/blob/master/string19.rb

What you've observed is described in section 3.3.

Basically, a Regexp which contains only ASCII characters is given an
encoding of US-ASCII regardless of the original string's encoding (this
is different to Strings, which might have an encoding of say UTF-8 but
have the ascii_only? property true if they contain only ASCII
characters).

However there is a hidden "fixed_encoding" property you can set on a
Regexp:

>> r1 = Regexp.new("string")
=> /string/
>> r2 = Regexp.new("string", Regexp::FIXEDENCODING)
=> /string/
>> r1.encoding
=> #<Encoding:US-ASCII>
>> r2.encoding
=> #<Encoding:UTF-8>
>> r1.fixed_encoding?
=> false
>> r2.fixed_encoding?
=> true

I say it's a "hidden" property because the flag isn't revealed if you
use inspect or to_s (unlike the //m, //i and //x properties)

>> r1.to_s
=> "(?-mix:string)"
>> r2.to_s
=> "(?-mix:string)"

HTH,

Brian.

(*) Except in as much as the entire Encoding nonsense in ruby 1.9 is one
enormous bug
--
Posted via http://www.ruby-forum.com/.

From: David Springer on
Perry,

In 1.9 there is only one optional parameter.

You can force the encoding of the string parameter (if needed)
AND also pass the options parameter.

Try this:

#!/usr/bin/env ruby

s = "string"
puts s.encoding
r = Regexp.new(s.encode("utf-8"), Regexp::ENC_UTF8)
puts r.encoding

Here is the output:

US-ASCII
UTF-8

-David
--
Posted via http://www.ruby-forum.com/.

From: Perry Smith on
Hi Brian and David,

Thanks. I'm doing more experimenting and I'm also looking at the source
code. I need to drag down the latest. I'm looking at 1.9.1 p243 right
now.

Regexp.new has a third optional argument -- it is sorta described in the
Pick Axe book but the code looks wrong. It can be either 'n' or 'xN'
where x can be anything. Perhaps that is gone in the latest code.

But the "fixed encoding" is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding "fixed".

The latest code that David posted answers exactly what my original
question was. Thanks!
--
Posted via http://www.ruby-forum.com/.