length in (utf8) characters ? [Perl]

Prev: FAQ 8.30 How can I convert my shell script to perl?
Next: FAQ 6.10 What is "/o" really for?

From: Peter Billam on 29 Apr 2010 12:54

On 2010-04-29, Helmut Richter <hhr-m(a)web.de> wrote:
> Now, when you read from a file or write to a file, it is suddenly
> important that you know what encoding is to be used in that file, ...
> ... it is you who has to tell perl, ...
> This is *also* true for STDIN/STDOUT/STDERR. The open pragma
> <http://perldoc.perl.org/open.html> might assist you in selecting
> the right layers depending on the locale -- if the locale correctly
> specifies the code which is by no means guaranteed ...
>
> I hope that was of some help.

Thank you Helmut, for explaining so clearly. It also confirms what
I was beginning to work out for myself. So now back to the code...

Thanks for your help,
Peter

--
Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html

From: Steve C on 29 Apr 2010 13:10

Peter Billam wrote:
> I'm confused... in "perldoc length" it says
>
> if the EXPR is in Unicode, you will get the
> number of characters, not the number of bytes.
>
> which is what I would want. But (in a one-line demo
> of a problem I have in a much larger module):
>
> $> perl -e '$l=length "�"; print "length=$l\n";'
> length=2
>
> But I want to see length=1 here... (in case your news-client
> doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
> on debian squeeze and everything else works fine in utf8.
>

When I paste the character in my newsreader, I am using ISO-8859-1, not UTF-8.
This works fine:

5.8.8 gives:
perl -e '$s="�"; $l=length $s; print "length $s =$l\n";'
length � =1

5.10.0 gives:
perl -e '$s="�"; $l=length $s; print "length $s =$l\n";'
length � =1

From: sln on 29 Apr 2010 13:13

On 29 Apr 2010 11:36:03 GMT, Peter Billam <peter(a)www.pjb.com.au> wrote:

>I'm confused... in "perldoc length" it says
>
> if the EXPR is in Unicode, you will get the
> number of characters, not the number of bytes.
>
>which is what I would want. But (in a one-line demo
>of a problem I have in a much larger module):
>
>$> perl -e '$l=length "�"; print "length=$l\n";'
>length=2
>
>But I want to see length=1 here... (in case your news-client
>doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
>on debian squeeze and everything else works fine in utf8.
>

I don't see that result, but it could be I'm on Windows.
Check your default PerlIO layers.

c:\temp>perl -e "print qq(length � = ),length('�'),qq(\n);"
length � = 1

c:\temp>perl -e "print qq(length \x{00f6} = ),length(qq(\x{00f6})),qq(\n);"
length � = 1

c:\temp>

-sln

From: Dilbert on 29 Apr 2010 16:15

On 29 avr, 19:13, s...(a)netherlands.com wrote:
> c:\temp>perl -e "print qq(length ö = ),length('ö'),qq(\n);"
> length ÷ = 1
>
> c:\temp>perl -e "print qq(length \x{00f6} = ),length(qq(\x{00f6})),qq(\n);"
> length ÷ = 1

Your "ö" character is displayed as "÷" (same with me under Windows)
Your codepage (chcp) seems to be set to 850 (same with me under
Windows)

C:\>chcp
Page de codes active : 850

C:\>perl -e"print qq<ö\x{00f6}>"
÷÷

Let's set the code page to 1252

C:\>chcp 1252
Page de codes active : 1252

C:\>perl -e"print qq<ö\x{00f6}>"
öö

That's better now, the "ö" characters are now printed as "ö"

From: Peter Billam on 29 Apr 2010 21:16

On 2010-04-29, Helmut Richter <hhr-m(a)web.de> wrote:
>
> On Thu, 29 Apr 2010, Peter Billam wrote:
>> the "length"
>> code and some "print"s are actually in a module, and the strings
>> are passed to it from some calling program. So when I code the
>> module I don't know in advance from what program is going to
>> be calling it, and whether it's printing into a utf environment.
>
> The open pragma <http://perldoc.perl.org/open.html> might assist you
> in selecting the right layers depending on the locale -- if the locale
> correctly specifies the code which is by no means guaranteed (e.g. the
> code may change from one window to another without being reflected
> in the locale environment variables). I have no experience with
> the open pragma, though, so you have to find your way through it.

By experiment, it seems that
use open ':locale';
(unlike most (all?) other pragmas) propagates out-of-scope from a
calling script into the use'd module. So I think the intent is that
the module code should ignore the whole problem, and the script that
use's it should invoke the open pragma or suffer the consequences.
All this needs 5.8.6.

I can't see an easy way for the use'd module to find out what
its default encoding currently is, except maybe
my @layers = PerlIO::get_layers(STDOUT);
and then intelligently inspect the last layer or two. So AFAICS,
if the use'd module has to change a binmode, then to restore it
it will be best to close the file and re-open. I could be wrong...

Thanks again for your help, I'm on my way now,
Peter

--
Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html

First | Prev | Next | Last
Pages: 1 2 3
Prev: FAQ 8.30 How can I convert my shell script to perl?
Next: FAQ 6.10 What is "/o" really for?