Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.
From: Ben Morrow on 28 Sep 2009 18:01 Quoth "Jochen Lehmeier" <OJZGSRPBZVCX(a)spammotel.com>: > On Mon, 28 Sep 2009 15:41:49 +0200, Helmut Richter <hhr-m(a)web.de> wrote: > > > Dealing with UTF-8 requires that byte strings and texts strings are > > meticulously kept apart. > > Uhm. What are byte strings, what are text strings? Perl does not use these > words > in the context of utf8. Perl doesn't maintain the distinction itself (this is one of the flaws in Perl's Unicode implementation) but it is nevertheless important to keep in mind whether a given string is meant to be Unicode characters or some encoding into bytes for IO. Proper use of binmode will let you make all your strings character strings in simple situations. > > else, STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been > > done), > > This should not be done. The correct line would be > > binmode STDOUT,":encoding(utf8)"; > > This activates error checking etc., while your version treats string as > utf8 while > not checking them at all, which could lead to bad_things[tm] (some docs > hinted > at segmentation faults even, though I do not know if that is true). It makes no difference on output. What is important to avoid is feeding invalid UTF8 to a handle with a :utf8 *input* layer: this certainly used to cause segfaults in the past, though they may have been replaced with fatal errors by now. Ben
From: Jochen Lehmeier on 28 Sep 2009 18:15 On Tue, 29 Sep 2009 00:01:37 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote: > [:utf8 vs. :encoding(utf8)] > It makes no difference on output. :encoding(utf8) does validate on output though, or doesn't it?
From: Ben Morrow on 28 Sep 2009 18:46 Quoth "Jochen Lehmeier" <OJZGSRPBZVCX(a)spammotel.com>: > On Tue, 29 Sep 2009 00:01:37 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote: > > > [:utf8 vs. :encoding(utf8)] > > It makes no difference on output. > > :encoding(utf8) does validate on output though, or doesn't it? What do you mean, 'validate'? Perl strings are (logically) sequences of Unicode characters, and any sequence of Unicode characters can be represented in utf8. If you end up with a perl string with a corrupted internal representation you've got bigger problems than invalid output encoding. Of course, perl's definition of 'utf8' is different from the Unicode Consortium's 'UTF-8': the standard forbids representations of surrogates and unassigned codepoints (and possibly other things I've forgotten). If you want perl to enforce these restrictions you need to ask for it with :encoding(UTF-8) (this appears to only be documented in perldoc Encode). Ben
From: Helmut Richter on 29 Sep 2009 06:35 On Mon, 28 Sep 2009, wrote: > I assume you did set the META charset of the HTML page to UTF-8? Or did > you let the browser guess about the encoding and then it returned the > wrong encoding in the form response? The problem is that the correct bytes arrive but are interpreted in a wrong way. Two answers in this thread show two different way to control that interpretation. I'll try them out. Thanks to all who have responded. -- Helmut Richter
From: Jochen Lehmeier on 29 Sep 2009 15:32 On Tue, 29 Sep 2009 00:46:54 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote: >> :encoding(utf8) does validate on output though, or doesn't it? > > What do you mean, 'validate'? To raise an error or at least display some warning about it when it encounters invalid bytes in an utf8 flagged string, like perl does at many other places. > Perl strings are (logically) sequences of > Unicode characters, and any sequence of Unicode characters can be > represented in utf8. If you end up with a perl string with a corrupted > internal representation you've got bigger problems than invalid output > encoding. Or I might simply be using some code which raises the utf8 flag on strings that are not. Example: > cat test.pl #!/usr/bin/perl -w use strict; use Encode; my $validUTF8 = "\x{1010}"; my $bytes = encode("utf8",$validUTF8); # e1 80 90 my $invalidUTF8 = substr($bytes,0,length($bytes)-1); # e1 80 open(TMP,">invalidutf8.txt") or die; print TMP $invalidUTF8; close TMP; open(TMP,"<invalidutf8.txt") or die; binmode TMP,":utf8"; my $invalid2 = <TMP>; close TMP; print STDERR "is_utf8: '".utf8::is_utf8($invalid2). "' valid: '".utf8::valid($invalid2)."'\n". "length: ".length($invalid2)."\n"; print $invalid2; binmode STDOUT,":utf8"; print $invalid2; binmode STDOUT,":encoding(utf8)"; print $invalid2; > perl --version This is perl, v5.8.8 built for i486-linux-gnu-thread-multi .... > perl test.pl | hexdump -C utf8 "\xE1" does not map to Unicode at test.pl line 16, <TMP> line 1. Malformed UTF-8 character (unexpected end of string) in length at test.pl line 19. is_utf8: '1' valid: '' length: 0 Wide character in print at test.pl line 23. 00000000 e1 80 e1 80 e1 80 |......| 00000006 > hexdump -C invalidutf8.txt 00000000 e1 80 |..| 00000002 The first line ("...does not map...") comes from reading the "binmode :utf8" handle. Note that $invalid2 contains exactly those two broken bytes from invalidutf8.txt, anyway. The next validation message comes from length($invalid2) ("...Malformed UTF-8..."). Note that the string is indeed utf8 flagged, though perl has noticed that it is invalid. This example is just to provide a quick test case for invalid utf8 in an utf8 string in perl. My point from the previous post was that I assumed :encoding(utf8) on the output handle would at least give another "...malformed..." message, or, better yet, would not output anything. It does not, though, it silently and happily outputs the broken utf8, just like printing to a non-utf8 handle (hence the "Wide character in print") or a ":utf8" handle. You are right then - "encoding(utf8)" seems only to differ from "utf8" when used on an input handle. If the 'binmode TMP,":encoding(utf8)";' is used when reading the broken bytes in, then everything works fine ($invalid2==undef, in that case). BTW, this is not purely hypotetical for me; I have to work with some broken modules which I cannot easily change, which in some weird cases produce such invalid utf8 strings. It would be interesting to see whether newer perls behaves the same. Is there someone who would like to run the test script through 5.10?
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: decimal round off issue Next: Simple question about CGI response after form data has been processed. |