length in (utf8) characters ? [Perl]

Prev: FAQ 8.30 How can I convert my shell script to perl?
Next: FAQ 6.10 What is "/o" really for?

From: Peter J. Holzer on 2 May 2010 08:24

On 2010-04-29 14:54, Peter Billam <peter(a)www.pjb.com.au> wrote:
> Perl should know if it's in a utf environment and printing to a utf8
> device; python does, and so does vi, less, slrn, alpine, firefox and
> everything else I use (except fmt).

vi, less, slrn, and alpine know that they are dealing with a terminal
and can assume that the environment correctly describes properties of
this terminal.

Perl or Python don't know this - the program written in Perl or Python
does not necessarily read from or write to a terminal - very often it
deals with files. These files are not necessarily text files and even if
they are, they are not necessarily in the native encoding.

I don't know how Python deals with this. Perl does have an environment
variable (PERL_UNICODE) which can be used to control the default
encoding. I've used it a lot during the early days of Perl 5.8.x and can
only say that it causes more trouble than it's worth. Just set the
appropriate encoding *in* your script - you have to think about your I/O
anyway while writing the script.

hp

From: Peter J. Holzer on 2 May 2010 11:44

On 2010-05-02 12:24, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
> On 2010-04-29 14:54, Peter Billam <peter(a)www.pjb.com.au> wrote:
>> Perl should know if it's in a utf environment and printing to a utf8
>> device; python does, and so does vi, less, slrn, alpine, firefox and
>> everything else I use (except fmt).
>
> vi, less, slrn, and alpine know that they are dealing with a terminal
> and can assume that the environment correctly describes properties of
> this terminal.

Actually, it is much more complicated:

All of these programs deal not only with the terminal, but with "files"
(I put quotes around that because it doesn't matter whether they are
stored on disk or received via a socket or pipe).

So while slrn may for example assume that the terminal sends and expects
UTF-8 (because the user told it by setting LANG), it cannot just
use UTF-8 for decoding Usenet postings. Instead it has inspect the
headers of each posting to find the Content-Type header and decode the
posting according to its charset parameter (and that still ignores the
complexities of multi-part MIME messages which slrn can't handle).

The situation is worse for vi: Unlike mail messages, text files aren't
supposed to be portable between systems. So the user has to tell the
editor for every file which charset it is unless it is the local charset
or the editor can guess (UTF-8 is pretty easy to detect but most of the
8-bit charsets are hard to distinguish).

hp

First | Prev |
Pages: 1 2 3
Prev: FAQ 8.30 How can I convert my shell script to perl?
Next: FAQ 6.10 What is "/o" really for?