UTF8 hell [Ruby]

Prev: Ruby Threads From C
Next: SOAP error: Cannot map <class> to SOAP/OM

From: Perry Smith on 23 Feb 2010 12:20

> A general hint for debugging encoding troubles: the UTF-8 encoding
> *guarantees* that every Unicode codepoint is *either* encoded into a
> *single* octet with its most significant bit cleared to 0 (i.e. a
> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
> octets, *all* of which have their MSB set to 1 (i.e. a decimal value
> between 128 and 255).

Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

--
Posted via http://www.ruby-forum.com/.

From: Jörg W Mittag on 23 Feb 2010 16:39

Perry Smith wrote:
>> A general hint for debugging encoding troubles: the UTF-8 encoding
>> *guarantees* that every Unicode codepoint is *either* encoded into a
>> *single* octet with its most significant bit cleared to 0 (i.e. a
>> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
>> octets, *all* of which have their MSB set to 1 (i.e. a decimal value
>> between 128 and 255).
> Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
> or 6 but not 3 nor 5 octects?

Nope.

First off: I was wrong, the longest encoding is actually 4 octets,
not 6. (I was confused by the algorithm: the algorithm actually allows
for up to 8 bytes, but because of the way Unicode characters are
allocated, and UTF-8 is defined, it is guaranteed that there will
never be more than 4.)

The encodings look like this:

0xxxxxxx for ASCII
110xxxxx 10xxxxxx for U+80 to U+7FF
1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF

This is actually pretty clever:

* you can always tell whether you are inside a multibyte sequence or
not because of the high bit,
* you can always tell whether a byte in the sequence is the first one
or a later one, because the first one always starts with 11 and the
other ones always start with 10 and
* you can always tell how long a sequence is by the number of 1 bits
in the start byte: two-byte sequences start with two 1s, three-byte
sequences start with three 1s and four-byte sequences start with
four 1s.

This means that you can usually re-synchronize pretty easily from the
middle of a corrupted network transmission, for example. You can also
jump over bytes if you are counting the length.

jwm

From: Robert Klemme on 23 Feb 2010 17:05

On 23.02.2010 12:10, Xavier Noëlle wrote:
> 2010/2/2 Robert Klemme <shortcutter(a)googlemail.com>:
>> You probably first want to find out whether the byte sequence is valid
>> UTF-8 or not. For that you would need to look at the bytes in the
>> String. I guess chances are that your String's byte sequence is NOT
>> valid UTF-8 OR you have a character in the string that has no
>> lowercase representation.

> I dug into the problem and ended up with this line: self.force_encoding('UTF-8')
> Believing that the string #encoding was right was a wrong choice, then
> I assumed the database provided valid UTF8 strings.

The string you show below does not look like UTF-8 encoded, probably
rather ISO-8859-1 or such. If you enforce an encoding you leave the
byte sequence untouched. This leads to the kind of error you describe
below.

> BUT (because, there's a but...), for some reason I don't understand,
> some strings are unwilling to work:
>
> Example:
> puts self => médicals
> self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
>
> 233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
> self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
> byte sequence in UTF-8 (ArgumentError).
>
> Where am I wrong ?

As far as I can see 233 starts a three byte sequence

http://en.wikipedia.org/wiki/UTF-8#Description

I did not dig deeper but it may be that by forcing UTF-8 on an ISO
something encoded string you broke it.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From: Michael Fellinger on 23 Feb 2010 23:12

On Wed, Feb 24, 2010 at 12:18 AM, Xavier NoÃ«lle <xavier.noelle(a)gmail.com> wrote:
> 2010/2/23 Yukihiro Matsumoto <matz(a)ruby-lang.org>:
>> 233 is not a valid UTF-8 character. Â The byte sequence for mÃ©dicals is
>> <109 195 169 100 105 99 97 108 115>.
>
> Indeed. In the meantime, I changed the code with this one:
> def isUTF8()
> Â begin
> Â Â self.unpack('U*')
> Â rescue
> Â Â return false
> Â end
> Â return true
> end
>
> if isUTF8()
> Â self.force_encoding('UTF-8')
> else
> Â self.force_encoding('ISO-8859-1')
> Â self.encode!('UTF-8')
> end

string = "\xE8te pour luth"
# "\xE8te pour luth"
string.encoding
# #<Encoding:UTF-8>
string.valid_encoding?
# false
string.force_encoding('ISO-8859-1')
# "Ã¨te pour luth"
string.valid_encoding?
# true
string.upcase
# "Ã¨TE POUR LUTH"

> This (ugly) quickfix works for what I need, but I don't know if this
> problem can be somehow resolved in another way. The problem being that
> my SQL database has a VARBINARY column with an unknown encoding. Is
> there a way to deal with the various possible encoding or to ask MySQL
> to return UTF8 converted data, or is it necessary to clean data before
> inserting them ?
>
> --
> Xavier NOELLE
>
>

--
Michael Fellinger
CTO, The Rubyists, LLC
972-996-5199

First | Prev |
Pages: 1 2 3
Prev: Ruby Threads From C
Next: SOAP error: Cannot map <class> to SOAP/OM