CGI and UTF-8 [Perl]

Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.

From: Ben Morrow on 29 Sep 2009 17:28

Quoth "Jochen Lehmeier" <OJZGSRPBZVCX(a)spammotel.com>:
> On Tue, 29 Sep 2009 00:46:54 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote:
>
> >> :encoding(utf8) does validate on output though, or doesn't it?
> >
> > What do you mean, 'validate'?
>
> To raise an error or at least display some warning about it when it
> encounters
> invalid bytes in an utf8 flagged string, like perl does at many other
> places.

Hmmm. IMHO that such strings can be created at all (the help of a
misbehaving XS module) is a bug in perl. If you find yourself getting
such strings from somewhere you should check all strings and correct or
discard the invalid ones as soon as possible. As I said before, I don't
know whether there are still any seg faults/memory corruption bugs
related to invalid utf8 in a SvUTF8 string, but I wouldn't be surprised.

> The first line ("...does not map...") comes from reading the "binmode
> :utf8" handle.
> Note that $invalid2 contains exactly those two broken bytes from
> invalidutf8.txt, anyway.

Yes. The :utf8 layer (alone) is always a mistake on input. That's the
real bug in this example: if it weren't for that, you would never have
created an incorrectly-marked string in the first place.

> The next validation message comes from length($invalid2) ("...Malformed
> UTF-8...").
> Note that the string is indeed utf8 flagged, though perl has noticed that
> it is invalid.
>
> This example is just to provide a quick test case for invalid utf8 in an
> utf8 string
> in perl. My point from the previous post was that I assumed
> :encoding(utf8) on the
> output handle would at least give another "...malformed..." message, or,
> better yet,
> would not output anything.

I believe there are *lots* of places where perl will fail to notice if
you have an invalidly marked string. Basically, perl only checks for
correct utf8 if it needs to decode the string, so anything like output
that just involves copying the bytes across (assuming they are valid, as
they ought to be) won't check.

> BTW, this is not purely hypotetical for me; I have to work with some
> broken modules
> which I cannot easily change, which in some weird cases produce such
> invalid utf8
> strings.

Make sure you check every string that might potentially be corrupted
with utf8::valid before using it for anything else.

> It would be interesting to see whether newer perls behaves the same. Is
> there someone
> who would like to run the test script through 5.10?

I get the same output as you with

This is perl, v5.10.1 (*) built for i386-freebsd-64int

(I'm not sure what that star's doing there. AFAIK FreeBSD don't patch
their perl any more...)

Ben

From: Jochen Lehmeier on 29 Sep 2009 18:21

On Tue, 29 Sep 2009 23:28:28 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote:

>> BTW, this is not purely hypotetical for me; I have to work with some
>> broken modules which I cannot easily change, which in some weird cases
>> produce such
>> invalid utf8 strings.
>
> Make sure you check every string that might potentially be corrupted
> with utf8::valid before using it for anything else.

Yes, of course. It would just be nice if :encoding(utf8) on an output layer
would catch those, kind of as a last resort, just the same as
:encoding(utf8)
catching them on input. Or, to put it the other way round,
I find it funny that perl warns in something like the length() function,
but not
while actually printing. Well, maybe in Perl 5.12. ;-)

From: Ben Morrow on 29 Sep 2009 18:47

Quoth "Jochen Lehmeier" <OJZGSRPBZVCX(a)spammotel.com>:
> On Tue, 29 Sep 2009 23:28:28 +0200, Ben Morrow <ben(a)morrow.me.uk> wrote:
>
> >> BTW, this is not purely hypotetical for me; I have to work with some
> >> broken modules which I cannot easily change, which in some weird cases
> >> produce such
> >> invalid utf8 strings.
> >
> > Make sure you check every string that might potentially be corrupted
> > with utf8::valid before using it for anything else.
>
> Yes, of course. It would just be nice if :encoding(utf8) on an output layer
> would catch those, kind of as a last resort, just the same as
> :encoding(utf8)
> catching them on input.

If you want it would be fairly easy to write a :via(ValidUTF8) layer
that checks everything with utf8::valid.

> Or, to put it the other way round,
> I find it funny that perl warns in something like the length() function,
> but not
> while actually printing.

As I said, perl only warns if it actually needs to decode the
characters. Printing (in utf8) doesn't, it just copies the bytes onto
the output stream.

> Well, maybe in Perl 5.12. ;-)

5.12 is supposed to be fixing many of the conceptual bugs in perl's
Unicode handling. I don't know whether that will extend this far.

Ben

From: Helmut Richter on 2 Oct 2009 11:04

On Mon, 28 Sep 2009, sln(a)netherlands.com wrote:

> From the docs:
> " PRAGMAS
> -utf8
> This makes CGI.pm treat all parameters as UTF-8 strings.
> Use this with care, as it will interfere with the processing of binary uploads.

This is the same problem for *both* solutions offered in this thread:
the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
have the suspicion that the effect of the pragma is not much more than
such
a setting.

> It is better to manually select which fields are expected to return utf-8 strings
> and convert them using code like this:
> use Encode;
> my $arg = decode utf8=>param('foo');

This is much less than half of the story. Getting a single parameter is a
fairly easy thing to do, with or without the CGI module. Using the CGI
module for producing HTML is only a very cumbersome way of writing
something in a complicated syntax that is much easier written directly in
HTML. For which task does the CGI module offer significant help, compared
with simply outputting HTML and analysing the input?

One of the (relatively few) things that are easier with the CGI module
than without is reusing the input values as defaults for the same form
when it must be output again because of incompletely or wrongly filled-in
values. Now, if I have to touch every single value, decode it, and store
it back into the structure, I could have hand-programmed that reuse with
not more effort.

--
Helmut Richter

From: Ben Morrow on 2 Oct 2009 13:32

Quoth Helmut Richter <hhr-m(a)web.de>:
>
> This is the same problem for *both* solutions offered in this thread:
> the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
> have the suspicion that the effect of the pragma is not much more than
> such
> a setting.

There's no need to suspect: read the docs. The only effect of the utf8
pragma is to tell perl that your source code is written in UTF-8. It has
no effect on the STDIO handles. (It *does* affect the DATA filehandle,
since that's what's used to read your source code.)

Ben

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.