unicode string alteration [Python]

Prev: Announcing: python-ghostscript 0.3
Next: How to parse a sentence using grammars provided by nltk?

From: BAvant Garde on 12 Aug 2010 11:44

HELP!!!
I need help with a unicode issue that has me stumped. I must be doing something wrong because I don't believe this condition would have slipped thru testing.

Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or unichr(1113088) is substituted and the file loses 1 character resulting in all trailing characters being shifted out of position. No other corrupt strings have been detected.

The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04 where the maximum ord # is 1114111 (wide Python build).

Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535 (narrow Python build) the string u'\U0010fc00' also occurs and it "seems" that the substitution takes place but no characters are lost and file sizes are ok. Note that ord(u'\U0010fc00')
causes the following error:
"TypeError: ord() expected a character, but string of length 2 found"
The condition is otherwise invisible in 2.5.4 and is handled internally without any apparent effect on processing with characters u'\udbff' and u'\udc00' each being separately accessible.

The first part of the attachment repeats this email but also has examples and illustrates other related oddities.

Any help would be greatly appreciated.
Bruce

From: MRAB on 12 Aug 2010 13:31

BAvant Garde wrote:
> HELP!!!
> I need help with a unicode issue that has me stumped. I must be doing
> something wrong because I don't believe this condition would have
> slipped thru testing.
>
> Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or
> unichr(1113088) is substituted and the file loses 1 character resulting
> in all trailing characters being shifted out of position. No other
> corrupt strings have been detected.
>
> The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04
> where the maximum ord # is 1114111 (wide Python build).
>
> Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535
> (narrow Python build) the string u'\U0010fc00' also occurs and it
> "seems" that the substitution takes place but no characters are lost and
> file sizes are ok. Note that ord(u'\U0010fc00') causes the following error:
> "TypeError: ord() expected a character, but string of
> length 2 found"
> The condition is otherwise invisible in 2.5.4 and is handled internally
> without any apparent effect on processing with characters u'\udbff' and
> u'\udc00' each being separately accessible.
>
> The first part of the attachment repeats this email but also has
> examples and illustrates other related oddities.
>
> Any help would be greatly appreciated.
>
It's not an error, it's a "surrogate pair". Surrogate pairs are part of
the Unicode specification.

Unicode codepoints go up to U+0010FFFF.

If you're using 16 bits per codepoint, like in a narrow build of Python,
then the codepoints above U+FFFF _can't_ be represented directly, so
they are represented by a pair of codepoints called a "surrogate pair".

If, on the other hand, you're using 32 bits per codepoint, like in a
wide build of Python, then the codepoints above U+FFFF _can_ be
represented directly, so surrogate pairs aren't needed, and, indeed
shouldn't be there.

What you're seeing in the wide build is Python replacing a surrogate
pair with the codepoint that it represents, which is actually the right
thing to do because, as I said, the surrogate pairs really shouldn't be
there.

|
Pages: 1
Prev: Announcing: python-ghostscript 0.3
Next: How to parse a sentence using grammars provided by nltk?