Prev: decimal round off issue
Next: Simple question about CGI response after form data has been processed.
From: Helmut Richter on 28 Sep 2009 09:41 I have the task of describing for authors how to prepare forms by CGI scripts in perl, in particular, how to modify existing scripts to conform to a new CMS. Meanwhile the CGI-generated pages are all in code UTF-8. If I have understood everything correctly, the cooperation of the standard CGI module and the Encode module is utterly tedious, as explained below. Perhaps I have not seen the obvious. Dealing with UTF-8 requires that byte strings and texts strings are meticulously kept apart. Now, one of the functions of the CGI module is the reuse of the last input as default for the next time. But the input is a byte string, so the default value must be a byte string as well. An example: We want to ask for a location and provide the default answer "M�nchen" (Munich's German name) as default in the form. The obvious, but wrong, way would be $cgi->textfield(-name =>'ort', -value => 'M�nchen', -size => 40) but that would interpret the string 'M�nchen' as a text string. This is always wrong: Either STDOUT is binary, then the wide character will hurt. Or else, STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the value, if not modified by the user of the form, comes back as something else, in this case as 'München' with the two bytes of the one UTF-8 character interpreted as two characters. After all, there is no way to do the equivalent of binmode for the post method of CGI. The only work-around which I have found is to consequently use byte strings: $Muenchen = encode ('utf8', 'M�nchen'); $cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40) This works but has the drawback that an extra step of decoding all input values to text strings is required when the interaction with the user of the form is over. I have the suspicion that I am thinking to complicated and that there is a simple -- and simple to explain -- method for dealing with CGI forms when the code used is UTF-8. -- Helmut Richter
From: J�rgen Exner on 28 Sep 2009 11:58 Helmut Richter <hhr-m(a)web.de> wrote: >We want to ask for a location and provide the default answer "M�nchen"[...] > > $cgi->textfield(-name =>'ort', -value => 'M�nchen', -size => 40) > >but that would interpret the string 'M�nchen' as a text string. This is always >wrong: Either STDOUT is binary, then the wide character will hurt. Or else, >STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the >value, if not modified by the user of the form, comes back as something else, >in this case as 'München' with the two bytes of the one UTF-8 character >interpreted as two characters. After all, there is no way to do the equivalent >of binmode for the post method of CGI. I assume you did set the META charset of the HTML page to UTF-8? Or did you let the browser guess about the encoding and then it returned the wrong encoding in the form response? jue
From: Peter J. Holzer on 28 Sep 2009 14:04 On 2009-09-28 13:41, Helmut Richter <hhr-m(a)web.de> wrote: [the usual problems with CGI and UTF-8] > I have the suspicion that I am thinking to complicated and that there is a > simple -- and simple to explain -- method for dealing with CGI forms when the > code used is UTF-8. > AFAICT no. Newer versions of CGI have some UTF-8 support, but it isn't documented at all. In previous threads I've poked around a bit in it and posted what I found: * news:slrng4ln1q.h0v.hjp-usenet2(a)hrunkner.hjp.at http://groups.google.at/groups/search?as_umsgid=slrng4ln1q.h0v.hjp-usenet2%40hrunkner.hjp.at&hl=en * news:slrnghu894.1qq.hjp-usenet2(a)hrunkner.hjp.at http://groups.google.at/groups/search?as_umsgid=slrnghu894.1qq.hjp-usenet2%40hrunkner.hjp.at&hl=en Hope that gives you a starting point. hp
From: Jochen Lehmeier on 28 Sep 2009 15:39 On Mon, 28 Sep 2009 15:41:49 +0200, Helmut Richter <hhr-m(a)web.de> wrote: > If I have understood everything correctly, the cooperation of the > standard CGI module and the Encode module is utterly tedious, as > explained below. Perhaps I have not seen the obvious. Perhaps. I don't exactly know what's going on with your code. I have only had good results when using existing CGI scripts with utf8. That is, scripts that used to run with latin1 were deployed "as is" in a utf8 setting. The biggest issues I ran into were with DBD::Oracle, which has some very ugly problems in the utf8 world indeed (which, to be honest, are documented as "features"), but that is a different story, not related to CGI. > Dealing with UTF-8 requires that byte strings and texts strings are > meticulously kept apart. Uhm. What are byte strings, what are text strings? Perl does not use these words in the context of utf8. > else, STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been > done), This should not be done. The correct line would be binmode STDOUT,":encoding(utf8)"; This activates error checking etc., while your version treats string as utf8 while not checking them at all, which could lead to bad_things[tm] (some docs hinted at segmentation faults even, though I do not know if that is true). > in this case as 'München' with the two bytes of the one UTF-8 character > interpreted as two characters. After all, there is no way to do the > equivalent of binmode for the post method of CGI. Sure there is_ binmode STDIN,":encoding(utf8)"; $query=new CGI(); If because of some reason you cannot run the binmode before you create the $query object (this happened to me for some reason I won't go into), then it's no problem either. Then you can convert the parameters after "new CGI()" read them from STDIN: # Warning, treat this as PSEUDO-CODE, it is from memory only $query=new CGI(); foreach $key ($query->param) { $query->param($key,Encode::decode("utf8",$query->param($key))); # Treating file upload parameters and multi-value parameters are left # as an excercise for the reader. } > I have the suspicion that I am thinking to complicated Aye. ;-) > and that there is a simple -- and simple to explain -- method for > dealing with CGI forms when the code used is UTF-8. binmode ... ":encoding(utf8)" on both STDIN and STDOUT. Plus proper declaration of the charset for your browser (in the HTTP header and the HTML header, just to be sure). Good luck!
From: sln on 28 Sep 2009 16:13 On Mon, 28 Sep 2009 15:41:49 +0200, Helmut Richter <hhr-m(a)web.de> wrote: >I have the task of describing for authors how to prepare forms by CGI scripts >in perl, in particular, how to modify existing scripts to conform to a new >CMS. Meanwhile the CGI-generated pages are all in code UTF-8. > <snip> >This works but has the drawback that an extra step of decoding all input >values to text strings is required when the interaction with the user of >the form is over. > >I have the suspicion that I am thinking to complicated and that there is a >simple -- and simple to explain -- method for dealing with CGI forms when the >code used is UTF-8. With Perl 5.10, cgi.pm version is $CGI::VERSION='3.41'; After some poking around in it, it looks as though it does all its filehandle work in binary mode (moreso for the uploads I guess). Without specifying the charset in cgi, my browser will display these cgi- generated literal strings 'München' 'M�nchen') as: München M�nchen - Western European (guessed) M�nchen M? - UTF-8 (user forced) where the same result as the second one if the html form is set to charset utf-8. If the form is coming back as 'München', which is utf-8, does that mean you set the html charset to utf-8? I mean, it shouldn't otherwise, should it? For OUTPUT, its better to set the charset to utf-8 then encode those strings that are unicode (ASCII doesen't matter), or set the binmode of STDOUT to :utf8. if you want to do everything. $Muenchen = encode ('utf8', 'M�nchen'); $cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40) For form INPUT, cgi.pm will auto-decode utf8, all form parameters for you when you query them. Its the same decode you did above. This can be set with a pragma in the use CGI statement like use CGI qw/:standard -utf8/; Aparently this pragma will only decode input. From the docs: " PRAGMAS -utf8 This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care, as it will interfere with the processing of binary uploads. It is better to manually select which fields are expected to return utf-8 strings and convert them using code like this: use Encode; my $arg = decode utf8=>param('foo'); " No matter how you look at it, if you need utf8 for input/output, there will be some encode/decode going on somewhere. You can avoid the encoding hassel by setting the binmode of STDOUT to utf8 (then this is ok: $cgi->textfield(-name =>'ort', -value => 'M�nchen', -size => 40), and if you don't expect any binary upload data (input), avoid the decode hassel by setting the -utf8 pragma for the form input parameters. Then set the charset to -utf8. Good luck! -sln
|
Next
|
Last
Pages: 1 2 3 4 Prev: decimal round off issue Next: Simple question about CGI response after form data has been processed. |