CGI and UTF-8 [Perl]

Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.

From: Ben Morrow on 28 Sep 2009 18:01

Quoth "Jochen Lehmeier" <OJZGSRPBZVCX(a)spammotel.com>:
> On Mon, 28 Sep 2009 15:41:49 +0200, Helmut Richter <hhr-m(a)web.de> wrote:
>
> > Dealing with UTF-8 requires that byte strings and texts strings are
> > meticulously kept apart.
>
> Uhm. What are byte strings, what are text strings? Perl does not use these
> words
> in the context of utf8.

Perl doesn't maintain the distinction itself (this is one of the flaws
in Perl's Unicode implementation) but it is nevertheless important to
keep in mind whether a given string is meant to be Unicode characters or
some encoding into bytes for IO. Proper use of binmode will let you make
all your strings character strings in simple situations.

> > else, STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been
> > done),
>
> This should not be done. The correct line would be
>
> binmode STDOUT,":encoding(utf8)";
>
> This activates error checking etc., while your version treats string as
> utf8 while
> not checking them at all, which could lead to bad_things[tm] (some docs
> hinted
> at segmentation faults even, though I do not know if that is true).

It makes no difference on output. What is important to avoid is feeding
invalid UTF8 to a handle with a :utf8 *input* layer: this certainly used
to cause segfaults in the past, though they may have been replaced with
fatal errors by now.

Ben

From: Jochen Lehmeier on 28 Sep 2009 18:15

On Tue, 29 Sep 2009 00:01:37 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote:

> [:utf8 vs. :encoding(utf8)]
> It makes no difference on output.

:encoding(utf8) does validate on output though, or doesn't it?

From: Ben Morrow on 28 Sep 2009 18:46

Quoth "Jochen Lehmeier" <OJZGSRPBZVCX(a)spammotel.com>:
> On Tue, 29 Sep 2009 00:01:37 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote:
>
> > [:utf8 vs. :encoding(utf8)]
> > It makes no difference on output.
>
> :encoding(utf8) does validate on output though, or doesn't it?

What do you mean, 'validate'? Perl strings are (logically) sequences of
Unicode characters, and any sequence of Unicode characters can be
represented in utf8. If you end up with a perl string with a corrupted
internal representation you've got bigger problems than invalid output
encoding.

Of course, perl's definition of 'utf8' is different from the Unicode
Consortium's 'UTF-8': the standard forbids representations of surrogates
and unassigned codepoints (and possibly other things I've forgotten). If
you want perl to enforce these restrictions you need to ask for it with
:encoding(UTF-8) (this appears to only be documented in perldoc Encode).

Ben

From: Helmut Richter on 29 Sep 2009 06:35

On Mon, 28 Sep 2009, wrote:

> I assume you did set the META charset of the HTML page to UTF-8? Or did
> you let the browser guess about the encoding and then it returned the
> wrong encoding in the form response?

The problem is that the correct bytes arrive but are interpreted in a wrong
way. Two answers in this thread show two different way to control that
interpretation. I'll try them out.

Thanks to all who have responded.

--
Helmut Richter

From: Jochen Lehmeier on 29 Sep 2009 15:32

On Tue, 29 Sep 2009 00:46:54 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote:

>> :encoding(utf8) does validate on output though, or doesn't it?
>
> What do you mean, 'validate'?

To raise an error or at least display some warning about it when it
encounters
invalid bytes in an utf8 flagged string, like perl does at many other
places.

> Perl strings are (logically) sequences of
> Unicode characters, and any sequence of Unicode characters can be
> represented in utf8. If you end up with a perl string with a corrupted
> internal representation you've got bigger problems than invalid output
> encoding.

Or I might simply be using some code which raises the utf8 flag on strings
that
are not.

Example:

> cat test.pl
#!/usr/bin/perl -w

use strict;
use Encode;

my $validUTF8 = "\x{1010}";
my $bytes = encode("utf8",$validUTF8); # e1 80 90
my $invalidUTF8 = substr($bytes,0,length($bytes)-1); # e1 80

open(TMP,">invalidutf8.txt") or die;
print TMP $invalidUTF8;
close TMP;

open(TMP,"<invalidutf8.txt") or die;
binmode TMP,":utf8";
my $invalid2 = <TMP>;
close TMP;

print STDERR "is_utf8: '".utf8::is_utf8($invalid2).
"' valid: '".utf8::valid($invalid2)."'\n".
"length: ".length($invalid2)."\n";

print $invalid2;

binmode STDOUT,":utf8";
print $invalid2;

binmode STDOUT,":encoding(utf8)";
print $invalid2;

> perl --version

This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
....

> perl test.pl | hexdump -C
utf8 "\xE1" does not map to Unicode at test.pl line 16, <TMP> line 1.
Malformed UTF-8 character (unexpected end of string) in length at test.pl
line 19.
is_utf8: '1' valid: ''
length: 0
Wide character in print at test.pl line 23.
00000000 e1 80 e1 80 e1 80 |......|
00000006

> hexdump -C invalidutf8.txt
00000000 e1 80 |..|
00000002

The first line ("...does not map...") comes from reading the "binmode
:utf8" handle.
Note that $invalid2 contains exactly those two broken bytes from
invalidutf8.txt, anyway.

The next validation message comes from length($invalid2) ("...Malformed
UTF-8...").
Note that the string is indeed utf8 flagged, though perl has noticed that
it is invalid.

This example is just to provide a quick test case for invalid utf8 in an
utf8 string
in perl. My point from the previous post was that I assumed
:encoding(utf8) on the
output handle would at least give another "...malformed..." message, or,
better yet,
would not output anything. It does not, though, it silently and happily
outputs the
broken utf8, just like printing to a non-utf8 handle (hence the "Wide
character in print")
or a ":utf8" handle.

You are right then - "encoding(utf8)" seems only to differ from "utf8"
when used on an
input handle. If the 'binmode TMP,":encoding(utf8)";' is used when reading
the broken
bytes in, then everything works fine ($invalid2==undef, in that case).

BTW, this is not purely hypotetical for me; I have to work with some
broken modules
which I cannot easily change, which in some weird cases produce such
invalid utf8
strings.

It would be interesting to see whether newer perls behaves the same. Is
there someone
who would like to run the test script through 5.10?

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.