From: A. Farber on
Hello,

I have a russian card game at
http://apps.facebook.com/video-preferans/
which I've recently moved from using urlencoded data
to XML data in UTF-8. Since then it often hangs
for the users and I suspect, that my subroutine:

sub enqueue {
my $child = shift;
my $data = shift;
my $fh = $child->{FH};
my $response = $child->{RESPONSE};

# flash.net.Socket.readUTF() expects 16-bit prefix in network
order
my $prefix = pack 'n', length $data;

# append to the end of the outgoing queue
push @{$response}, $prefix . $data;
}

packs wrong number of bytes for cyrillic messages.

I'm using perl v5.10.0 at OpenBSD 4.5 and
"perldoc -tf length" suggests using
length(Encoding::encode_utf8(EXPR))

But when I put the line:

use Encode::Encoding;
....
my $prefix = pack 'n', length(Encoding::encode_utf8($data));

then it borks with

Undefined subroutine &Encoding::encode_utf8 called at Child.pm line
229.

Any help please?

Also I have to mention, that when users chat
in Russian, my server just passes their cyrillic
messages around (with sysread - poll - syswrite).

But for their cyrillic words in my program (I "use utf8;")
I have to call utf8::encode($cyrillic_word) before I can
write it away with syswrite or it would die ("wide char").

I've tried moving utf8::encode($data) into the
enqueue subroutine above but it doesn' allow me
(maybe because parts of $data are not utf8??)

Regards
Alex



From: sln on
On Wed, 17 Feb 2010 10:28:59 -0800 (PST), "A. Farber" <alexander.farber(a)gmail.com> wrote:

>Hello,
>
>I have a russian card game at
>http://apps.facebook.com/video-preferans/
>which I've recently moved from using urlencoded data
>to XML data in UTF-8. Since then it often hangs
>for the users and I suspect, that my subroutine:
>
>sub enqueue {
> my $child = shift;
> my $data = shift;
> my $fh = $child->{FH};
> my $response = $child->{RESPONSE};
>
> # flash.net.Socket.readUTF() expects 16-bit prefix in network
>order
> my $prefix = pack 'n', length $data;
>
> # append to the end of the outgoing queue
> push @{$response}, $prefix . $data;
>}
>
>packs wrong number of bytes for cyrillic messages.
>
If '$data' is still a Perl string,
I would encode() to UTF-8 octets then
push @outarray, pack ('n a*', length($octets), $octets);
But, you could do it a couple of different ways. Basically
you want the length to be of the encoded data, not the length
of the perl string (if it's in Perl character semantics).

You really don't want to push '$prefix . $data' if $data is
not yet encoded utf-8. If it is already encoded utf-8, then
the length would be correct because its already bytes (octets),
not character semantics.

You should read the Unicode docs: perluniintro, perlunicode, unicode, etc.
Each have links that take you to each other documentation.

Below is some examples of a couple of ways to do it. See what works
for you.

-sln

----------------------
use strict;
use warnings;
use Encode;

binmode (STDOUT, ':encoding(UTF-8)');

##
my $perlstring = "This is a string <\x{2100}>...";
my $utf8octets = encode('UTF-8', $perlstring);
my $packd_string = pack('n', length($utf8octets));
my $unpackd_string = unpack('n', $packd_string);
print "** Perl string : '$perlstring', length = ", length($perlstring),"\n\n";
print "UTF-8 octets: '$utf8octets', length = ", length($utf8octets),"\n\n";
print "Packed length of encoded string is $unpackd_string\n\n";

##
my $len_plus_octets = $packd_string . $utf8octets;
print "Length.UTF-8 octets: '$len_plus_octets'\n\n";

##
my $packd_all = pack ('n a*', length($utf8octets), $utf8octets);
print "Packed all : '$packd_all', length = ",length($packd_all),"\n\n";

##
my ($len,$octets) = unpack ('n a*', $packd_all);
print "Unpacked all : '$octets', length = ",length($octets),"\n";
print " : read packed length = $len\n\n";
my $decoded_string = decode('UTF-8', $octets);
print "** Perl string : '$decoded_string', length = ", length($decoded_string), "\n\n";
if ($decoded_string eq $perlstring) {
print "** Perl strings are equal.\n";
}
else {
print "** Perl strings are not equal.\n";
}
__END__
** Perl string : 'This is a string <G��>...', length = 23

UTF-8 octets: 'This is a string <+�-�-�>...', length = 25

Packed length of encoded string is 25

Length.UTF-8 octets: ' ?This is a string <+�-�-�>...'

Packed all : ' ?This is a string <+�-�-�>...', length = 27

Unpacked all : 'This is a string <+�-�-�>...', length = 25
: read packed length = 25

** Perl string : 'This is a string <G��>...', length = 23

** Perl strings are equal.


From: A. Farber on
Thank you! I've ended up with encode($data) and after that the
length() gives me the number of bytes for the syswrite (I hope)