CGI and UTF-8 [Perl]

Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.

From: Helmut Richter on 28 Sep 2009 09:41

I have the task of describing for authors how to prepare forms by CGI scripts
in perl, in particular, how to modify existing scripts to conform to a new
CMS. Meanwhile the CGI-generated pages are all in code UTF-8.

If I have understood everything correctly, the cooperation of the standard CGI
module and the Encode module is utterly tedious, as explained below. Perhaps
I have not seen the obvious.

Dealing with UTF-8 requires that byte strings and texts strings are
meticulously kept apart. Now, one of the functions of the CGI module is the
reuse of the last input as default for the next time. But the input is a byte
string, so the default value must be a byte string as well. An example:

We want to ask for a location and provide the default answer "M�nchen"
(Munich's German name) as default in the form. The obvious, but wrong, way
would be

$cgi->textfield(-name =>'ort', -value => 'M�nchen', -size => 40)

but that would interpret the string 'M�nchen' as a text string. This is always
wrong: Either STDOUT is binary, then the wide character will hurt. Or else,
STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the
value, if not modified by the user of the form, comes back as something else,
in this case as 'München' with the two bytes of the one UTF-8 character
interpreted as two characters. After all, there is no way to do the equivalent
of binmode for the post method of CGI.

The only work-around which I have found is to consequently use byte strings:

$Muenchen = encode ('utf8', 'M�nchen');
$cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40)

This works but has the drawback that an extra step of decoding all input
values to text strings is required when the interaction with the user of
the form is over.

I have the suspicion that I am thinking to complicated and that there is a
simple -- and simple to explain -- method for dealing with CGI forms when the
code used is UTF-8.

--
Helmut Richter

From: J�rgen Exner on 28 Sep 2009 11:58

Helmut Richter <hhr-m(a)web.de> wrote:
>We want to ask for a location and provide the default answer "M�nchen"[...]
>
> $cgi->textfield(-name =>'ort', -value => 'M�nchen', -size => 40)
>
>but that would interpret the string 'M�nchen' as a text string. This is always
>wrong: Either STDOUT is binary, then the wide character will hurt. Or else,
>STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the
>value, if not modified by the user of the form, comes back as something else,
>in this case as 'München' with the two bytes of the one UTF-8 character
>interpreted as two characters. After all, there is no way to do the equivalent
>of binmode for the post method of CGI.

I assume you did set the META charset of the HTML page to UTF-8? Or did
you let the browser guess about the encoding and then it returned the
wrong encoding in the form response?

jue

From: Peter J. Holzer on 28 Sep 2009 14:04

On 2009-09-28 13:41, Helmut Richter <hhr-m(a)web.de> wrote:
[the usual problems with CGI and UTF-8]
> I have the suspicion that I am thinking to complicated and that there is a
> simple -- and simple to explain -- method for dealing with CGI forms when the
> code used is UTF-8.
>

AFAICT no. Newer versions of CGI have some UTF-8 support, but it isn't
documented at all. In previous threads I've poked around a bit in it
and posted what I found:

* news:slrng4ln1q.h0v.hjp-usenet2(a)hrunkner.hjp.at
http://groups.google.at/groups/search?as_umsgid=slrng4ln1q.h0v.hjp-usenet2%40hrunkner.hjp.at&hl=en

* news:slrnghu894.1qq.hjp-usenet2(a)hrunkner.hjp.at
http://groups.google.at/groups/search?as_umsgid=slrnghu894.1qq.hjp-usenet2%40hrunkner.hjp.at&hl=en

Hope that gives you a starting point.

hp

From: Jochen Lehmeier on 28 Sep 2009 15:39

On Mon, 28 Sep 2009 15:41:49 +0200, Helmut Richter <hhr-m(a)web.de> wrote:

> If I have understood everything correctly, the cooperation of the
> standard CGI module and the Encode module is utterly tedious, as
> explained below. Perhaps I have not seen the obvious.

Perhaps. I don't exactly know what's going on with your code. I have only
had
good results when using existing CGI scripts with utf8. That is, scripts
that used
to run with latin1 were deployed "as is" in a utf8 setting.

The biggest issues I ran into were with DBD::Oracle, which has some very
ugly
problems in the utf8 world indeed (which, to be honest, are documented as
"features"),
but that is a different story, not related to CGI.

> Dealing with UTF-8 requires that byte strings and texts strings are
> meticulously kept apart.

Uhm. What are byte strings, what are text strings? Perl does not use these
words
in the context of utf8.

> else, STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been
> done),

This should not be done. The correct line would be

binmode STDOUT,":encoding(utf8)";

This activates error checking etc., while your version treats string as
utf8 while
not checking them at all, which could lead to bad_things[tm] (some docs
hinted
at segmentation faults even, though I do not know if that is true).

> in this case as 'MÃ¼nchen' with the two bytes of the one UTF-8 character
> interpreted as two characters. After all, there is no way to do the
> equivalent of binmode for the post method of CGI.

Sure there is_

binmode STDIN,":encoding(utf8)";
$query=new CGI();

If because of some reason you cannot run the binmode before you create the
$query
object (this happened to me for some reason I won't go into), then it's no
problem either.
Then you can convert the parameters after "new CGI()" read them from STDIN:

# Warning, treat this as PSEUDO-CODE, it is from memory only
$query=new CGI();
foreach $key ($query->param)
{
$query->param($key,Encode::decode("utf8",$query->param($key)));

# Treating file upload parameters and multi-value parameters are left
# as an excercise for the reader.
}

> I have the suspicion that I am thinking to complicated

Aye. ;-)

> and that there is a simple -- and simple to explain -- method for
> dealing with CGI forms when the code used is UTF-8.

binmode ... ":encoding(utf8)" on both STDIN and STDOUT. Plus proper
declaration of the charset
for your browser (in the HTTP header and the HTML header, just to be sure).

Good luck!

From: sln on 28 Sep 2009 16:13

On Mon, 28 Sep 2009 15:41:49 +0200, Helmut Richter <hhr-m(a)web.de> wrote:

>I have the task of describing for authors how to prepare forms by CGI scripts
>in perl, in particular, how to modify existing scripts to conform to a new
>CMS. Meanwhile the CGI-generated pages are all in code UTF-8.
>

<snip>

>This works but has the drawback that an extra step of decoding all input
>values to text strings is required when the interaction with the user of
>the form is over.
>
>I have the suspicion that I am thinking to complicated and that there is a
>simple -- and simple to explain -- method for dealing with CGI forms when the
>code used is UTF-8.

With Perl 5.10, cgi.pm version is $CGI::VERSION='3.41';
After some poking around in it, it looks as though it does all its filehandle
work in binary mode (moreso for the uploads I guess).

Without specifying the charset in cgi, my browser will display these cgi-
generated literal strings 'München' 'M�nchen') as:

München M�nchen - Western European (guessed)
M�nchen M? - UTF-8 (user forced)

where the same result as the second one if the html form is set to
charset utf-8.

If the form is coming back as 'München', which is utf-8, does that mean
you set the html charset to utf-8? I mean, it shouldn't otherwise, should it?

For OUTPUT, its better to set the charset to utf-8 then encode those strings that
are unicode (ASCII doesen't matter), or set the binmode of STDOUT to :utf8.
if you want to do everything.
$Muenchen = encode ('utf8', 'M�nchen');
$cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40)

For form INPUT, cgi.pm will auto-decode utf8, all form parameters for you
when you query them. Its the same decode you did above.
This can be set with a pragma in the use CGI statement like
use CGI qw/:standard -utf8/;
Aparently this pragma will only decode input.
From the docs:
" PRAGMAS
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings.
Use this with care, as it will interfere with the processing of binary uploads.
It is better to manually select which fields are expected to return utf-8 strings
and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');
"

No matter how you look at it, if you need utf8 for input/output, there will be some
encode/decode going on somewhere.

You can avoid the encoding hassel by setting the binmode
of STDOUT to utf8 (then this is ok:
$cgi->textfield(-name =>'ort', -value => 'M�nchen', -size => 40),
and if you don't expect any binary upload data (input),
avoid the decode hassel by setting the -utf8 pragma for the
form input parameters.
Then set the charset to -utf8.

Good luck!
-sln

| Next | Last
Pages: 1 2 3 4
Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.