CGI and UTF-8 [Perl]

Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.

From: sln on 2 Oct 2009 14:31

On Fri, 2 Oct 2009 17:04:04 +0200, Helmut Richter <hhr-m(a)web.de> wrote:

>On Mon, 28 Sep 2009, sln(a)netherlands.com wrote:
>
>> From the docs:
>> " PRAGMAS
>> -utf8
>> This makes CGI.pm treat all parameters as UTF-8 strings.
>> Use this with care, as it will interfere with the processing of binary uploads.
>
>This is the same problem for *both* solutions offered in this thread:
>the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
>have the suspicion that the effect of the pragma is not much more than
>such
>a setting.

Actually, this does NOT refer to use utf8; pragma, which is not a solution.
This is actually a parameter, a private symbol that CGI.pm uses. When set
by the user, CGI.pm does automatic parameter decoding (so you don't have to).
I haven't used CGI, why does STDIN need binmode set to utf8? For the same reason
STDOUT needs to be set?

Anyway, this is some code from CGI.pm (look at '-->' in margin that is marked)
and you should look at the actual module once in a while to glean some insight.
==============
# >>>>> Here are some globals that you might want to adjust <<<<<<
sub initialize_globals {
...
# return everything as utf-8
--> $PARAM_UTF8 = 0;
...
}
sub _setup_symbols {
my $self = shift;
my $compile = 0;

# to avoid reexporting unwanted variables
undef %EXPORT;

foreach (@_) {
$HEADERS_ONCE++, next if /^[:-]unique_headers$/;
$NPH++, next if /^[:-]nph$/;
$NOSTICKY++, next if /^[:-]nosticky$/;
$DEBUG=0, next if /^[:-]no_?[Dd]ebug$/;
$DEBUG=2, next if /^[:-][Dd]ebug$/;
$USE_PARAM_SEMICOLONS++, next if /^[:-]newstyle_urls$/;
--> $PARAM_UTF8++, next if /^[:-]utf8$/;
$XHTML++, next if /^[:-]xhtml$/;
$XHTML=0, next if /^[:-]no_?xhtml$/;
$USE_PARAM_SEMICOLONS=0, next if /^[:-]oldstyle_urls$/;
$PRIVATE_TEMPFILES++, next if /^[:-]private_tempfiles$/;
$TABINDEX++, next if /^[:-]tabindex$/;
$CLOSE_UPLOAD_FILES++, next if /^[:-]close_upload_files$/;
$EXPORT{$_}++, next if /^[:-]any$/;
$compile++, next if /^[:-]compile$/;
$NO_UNDEF_PARAMS++, next if /^[:-]no_undef_params$/;

# This is probably extremely evil code -- to be deleted some day.
if (/^[-]autoload$/) {
my($pkg) = caller(1);
*{"${pkg}::AUTOLOAD"} = sub {
my($routine) = $AUTOLOAD;
$routine =~ s/^.*::/CGI::/;
&$routine;
};
next;
}

foreach (&expand_tags($_)) {
tr/a-zA-Z0-9_//cd; # don't allow weird function names
$EXPORT{$_}++;
}
}
_compile_all(keys %EXPORT) if $compile;
@SAVED_SYMBOLS = @_;
}
#### Method: param
# Returns the value(s)of a named parameter.
# If invoked in a list context, returns the
# entire list. Otherwise returns the first
# member of the list.
# If name is not provided, return a list of all
# the known parameters names available.
# If more than one argument is provided, the
# second and subsequent arguments are used to
# set the value of the parameter.
####
sub param {
my($self,@p) = self_or_default(@_);
return $self->all_parameters unless @p;
my($name,$value,@other);

# For compatibility between old calling style and use_named_parameters() style,
# we have to special case for a single parameter present.
if (@p > 1) {
($name,$value,@other) = rearrange([NAME,[DEFAULT,VALUE,VALUES]],@p);
my(@values);

if (substr($p[0],0,1) eq '-') {
@values = defined($value) ? (ref($value) && ref($value) eq 'ARRAY' ? @{$value} : $value) : ();
} else {
foreach ($value,@other) {
push(@values,$_) if defined($_);
}
}
# If values is provided, then we set it.
if (@values or defined $value) {
$self->add_parameter($name);
$self->{param}{$name}=[@values];
}
} else {
$name = $p[0];
}

return unless defined($name) && $self->{param}{$name};

my @result = @{$self->{param}{$name}};

--> if ($PARAM_UTF8) {
--> eval "require Encode; 1;" unless Encode->can('decode'); # bring in these functions
--> @result = map {ref $_ ? $_ : Encode::decode(utf8=>$_) } @result;
}

return wantarray ? @result : $result[0];
}
=====================

>
>> It is better to manually select which fields are expected to return utf-8 strings
>> and convert them using code like this:
>> use Encode;
>> my $arg = decode utf8=>param('foo');
>
>This is much less than half of the story. Getting a single parameter is a
>fairly easy thing to do, with or without the CGI module.

As stated above, you can tell CGI.pm to decode ALL imput parameters for you,
taking care of having to individually do it.

>Using the CGI
>module for producing HTML is only a very cumbersome way of writing
>something in a complicated syntax that is much easier written directly in
>HTML. For which task does the CGI module offer significant help, compared
>with simply outputting HTML and analysing the input?
>
>One of the (relatively few) things that are easier with the CGI module
>than without is reusing the input values as defaults for the same form
>when it must be output again because of incompletely or wrongly filled-in
>values. Now, if I have to touch every single value, decode it, and store
>it back into the structure, I could have hand-programmed that reuse with
>not more effort.

Yeah, there is no argument there. The Unicode (flavor utf-8) gatekeeper for
Perl is usually the file i/o layers which covers both input and output.
For example, the :utf8 layer on handles.

However, binary text can creep into data variables at various times and via
various paths. I guess its up to you to know where and when this can happen.
Decode in, encode out if thats the case. But don't encode more than once,
or data that hasen't been decoded.

-sln

From: Peter J. Holzer on 2 Oct 2009 16:42

On 2009-10-02 17:32, Ben Morrow <ben(a)morrow.me.uk> wrote:
> Quoth Helmut Richter <hhr-m(a)web.de>:
>> This is the same problem for *both* solutions offered in this thread:
>> the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
>> have the suspicion that the effect of the pragma is not much more than
>> such
>> a setting.
>
> There's no need to suspect: read the docs. The only effect of the utf8
> pragma is to tell perl that your source code is written in UTF-8.

I think he meant the -utf8 pragma of the CGI module, i.e.

use CGI qw/-utf8/;

hp

From: Ben Bullock on 3 Oct 2009 06:51

On Oct 3, 12:04 am, Helmut Richter <hh...(a)web.de> wrote:
> On Mon, 28 Sep 2009, s...(a)netherlands.com wrote:
> > From the docs:
> > " PRAGMAS
> > -utf8
> > This makes CGI.pm treat all parameters as UTF-8 strings.
> > Use this with care, as it will interfere with the processing of binary uploads.
>
> This is the same problem for *both* solutions offered in this thread:
> the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
> have the suspicion that the effect of the pragma is not much more than
> such
> a setting.

This web page has a working example of what sln is discussing:

http://www.lemoda.net/perl/strip-diacritics/

The line

use CGI '-utf8';

makes CGI.pm do what you seem to want it to do: convert the input from
the form into Perl's "utf8" or "character semantics" or whatever
they're calling it these days.

(I tried to send this by a free server & that seems to have failed, so
I am resending via Google Groups. Apologies if this message turns up
twice.)

From: cmic on 4 Oct 2009 15:38

Hello

On 29 sep, 00:46, Ben Morrow <b...(a)morrow.me.uk> wrote:
....
> Of course, perl's definition of 'utf8' is different from the Unicode
> Consortium's 'UTF-8': the standard forbids representations of surrogates
> and unassigned codepoints (and possibly other things I've forgotten). If
> you want perl to enforce these restrictions you need to ask for it with
> :encoding(UTF-8) (this appears to only be documented in perldoc Encode).

OK. But I can't find any clear explanation in perldoc Encode, but in
perldoc binmode.

Thank you for allowing me to rehearse this point though
Rgds
--
michel maron caka cmic

>
> Ben

From: Helmut Richter on 6 Oct 2009 11:12

On Sat, 3 Oct 2009, Ben Bullock wrote:

> This web page has a working example of what sln is discussing:
>
> http://www.lemoda.net/perl/strip-diacritics/

This is an entirely different issue, albeit also an interesting one.

It occurs when data are expected in a restricted character set, e.g. in
ISO-8859-1, but are input via a medium that allows UTF-8, e.g. a Web form.
Then an encode function is needed that would not blow up with the first
character that is not in the target character set.

It is not easy to provide a standard solution because the solution is
dependent on the target character set, where US-ASCII is not the only
conceivable one. For instance, real quotes (U+201C/D/E) should be mapped
to ASCII quotes (U+0022) if and only if the target character set does not
contain them (e.g. they are not contained in ISO-8859-1 but contained in
its Windows cousin CP1252).

The solution presented at that URL covers only the trivial (but useful
because many characters are affected) case where stripping diacritics does
the job. Many others (quotes, ordinary mathematical symbols, etc.) must be
given reasonable substitutes by hand.

And, of course, different cultures may differ which substitutes do the job best.
For a German, an "�" is in any case to be rendered as "ae", for Finnish or
Swedish people, this is not the right thing to do.

--
Helmut Richter

First | Prev |
Pages: 1 2 3 4
Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.