length in (utf8) characters ? [Perl]

Prev: FAQ 8.30 How can I convert my shell script to perl?
Next: FAQ 6.10 What is "/o" really for?

From: Peter Billam on 29 Apr 2010 07:36

I'm confused... in "perldoc length" it says

if the EXPR is in Unicode, you will get the
number of characters, not the number of bytes.

which is what I would want. But (in a one-line demo
of a problem I have in a much larger module):

$> perl -e '$l=length "ö"; print "length=$l\n";'
length=2

But I want to see length=1 here... (in case your news-client
doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
on debian squeeze and everything else works fine in utf8.

Regards, Peter

--
Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html

From: Helmut Richter on 29 Apr 2010 08:59

On Thu, 29 Apr 2010, Peter Billam wrote:

> $> perl -e '$l=length "�"; print "length=$l\n";'
> length=2
>
> But I want to see length=1 here... (in case your news-client
> doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
> on debian squeeze and everything else works fine in utf8.

What happens:

What you input is two bytes long, and perl does not know that the two
bytes are meant as one character. perl sees the two characters "ö". If
you output them as Unicode, you will even see them:

perl -e '$l=length "�"; binmode (STDOUT, "utf8"); print "length=$l === �\n";'

yields

length=2 === ö

That is, the binary output of the binary string "�" was two errors that
compensated each other.

What you mean:

The input file is already to be interpreted as UTF-8. You should tell perl so:

perl -e 'use utf8; $l=length "�"; print "length=$l\n";'

--
Helmut Richter

From: Helmut Richter on 29 Apr 2010 09:00

On Thu, 29 Apr 2010, Helmut Richter wrote:

> The input file is already to be interpreted as UTF-8. You should tell perl so:

Better: The source file ...

--
Helmut Richter

From: Peter Billam on 29 Apr 2010 10:54

On 2010-04-29, Helmut Richter <hhr-m(a)web.de> wrote:
> On Thu, 29 Apr 2010, Helmut Richter wrote:
>> The input file is already to be interpreted as UTF-8.
>> You should tell perl so:
>
> Better: The source file ...

But if I tell perl that the source file is in utf8, then though
it gets the length right :-) it can't print the string out :-(

$> perl -e 'use utf8; $s="ö"; $l=length $s; print "length $s =$l\n";'
length =1

( likewise if I use the code-point: '$s="\x{00f6}"; )

OTOH if I don't use "use utf8" then perl prints correctly :-)
but gets the length wrong :-(

$> perl -e '$s="ö"; $l=length $s; print "length $s =$l\n";'
length ö =2

I can't really afford to set the binmode explicitly; the "length"
code and some "print"s are actually in a module, and the strings
are passed to it from some calling program. So when I code the
module I don't know in advance from what program is going to
be calling it, and whether it's printing into a utf environment.
Does the module really have to test every string and inspect
$ENV{LANG} and $ENV{LC_TYPE} and change binmode accordingly ?
I had been reading perldoc perluniintro:

Starting from Perl 5.8.0, the use of "use utf8" is needed only in
much more restricted circumstances. In earlier releases the "utf8"
pragma was used to declare that operations in the current block or
file would be Unicode-aware. This model was found to be wrong,
or at least clumsy: the "Unicodeness" is now carried with the data,
instead of being attached to the operations.

so why is the "print" wrong, if the "Unicodeness" is carried with
the data ? Perl should know if it's in a utf environment and
printing to a utf8 device; python does, and so does vi, less,
slrn, alpine, firefox and everything else I use (except fmt).

Sorry for being so confused, I realise this must be old stuff :-(
Peter

--
Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html

From: Helmut Richter on 29 Apr 2010 12:02

On Thu, 29 Apr 2010, Peter Billam wrote:

> I can't really afford to set the binmode explicitly; the "length"
> code and some "print"s are actually in a module, and the strings
> are passed to it from some calling program. So when I code the
> module I don't know in advance from what program is going to
> be calling it, and whether it's printing into a utf environment.
> Does the module really have to test every string and inspect
> $ENV{LANG} and $ENV{LC_TYPE} and change binmode accordingly ?
> I had been reading perldoc perluniintro:
>
> Starting from Perl 5.8.0, the use of "use utf8" is needed only in
> much more restricted circumstances. In earlier releases the "utf8"
> pragma was used to declare that operations in the current block or
> file would be Unicode-aware. This model was found to be wrong,
> or at least clumsy: the "Unicodeness" is now carried with the data,
> instead of being attached to the operations.
>
> so why is the "print" wrong, if the "Unicodeness" is carried with
> the data ?

I find the term "Unicodeness" confusing, much more than the distinction of
"character strings" vs. "byte strings" (as in
http://perldoc.perl.org/perlunitut.html). It is *you*, the programmer, who has
to know whether strings are meant as strings of characters or a strings of
bytes. Obviously, your strings are strings of characters. Whether perl stores
them as Unicode or as anything else is not your problem, you cannot know and
you need not know.

Now, when you read from a file or write to a file, it is suddenly important
that you know what encoding is to be used in that file, because the character
strings whose internal encoding you do not know must be constructed from the
bytes in the file (or, on writing, they must be stored as bytes in the file).
As the code used in the file cannot be determined reliably from the name or
the contents of the file, it is you who has to tell perl, either by explicitly
decoding/encoding the strings from/to the code, or by specifying the code as a
layer on open/binmode.

This is *also* true for STDIN/STDOUT/STDERR. The open pragma
<http://perldoc.perl.org/open.html> might assist you in selecting the right
layers depending on the locale -- if the locale correctly specifies the code
which is by no means guaranteed (e.g. the code may change from one window to
another without being reflected in the locale environment variables). I have
no experience with the open pragma, though, so you have to find your way
through it.

The utf8 pragma has no effect whatsoever on what the program does. It affects
only the interpretation of the bytes in the source code. If your source code
is in UTF-8 and contains "�", you should use the utf8 pragma if this "�" means
one character, and you should not use it if it means two bytes (which in turn
will be interpreted as two characters when you (ab)use this byte string in a
context where a character string is needed).

> Perl should know if it's in a utf environment and
> printing to a utf8 device; python does, and so does vi, less,
> slrn, alpine, firefox and everything else I use (except fmt).

Whether the choice of perl that it does not guess the code without being told
so is a good one, is a matter of opinion. It can be tedious in environments
where the same code is used everywhere, including all files and all databases,
but can save your application if this requirement is not met.

I hope that was of some help.

--
Helmut Richter

| Next | Last
Pages: 1 2 3
Prev: FAQ 8.30 How can I convert my shell script to perl?
Next: FAQ 6.10 What is "/o" really for?