From: rebelde on

Hello,

We are porting some UTF-8 ready application from Linux to SunOS 5.9 and
running in the following unclear problem. After a lot of digging I'm able
to simplify the problem in the following snipp of C-code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>

main()
{

char *asc = "a";
char *utf = "\303\204"; /* this is an UTF-8 German A with dots */

char buf[80];

setlocale(LC_ALL, "");

sprintf(buf, "%-*.*s", 16, 16, asc);
printf("strlen of buf with ascii char %d\n", strlen(buf));
printf("[%s]\n", buf);


sprintf(buf, "%-*.*s", 16, 16, utf);
printf("strlen of buf utf char %d\n", strlen(buf));
printf("[%s]\n", buf);

exit(0);
}

If you compile and run it you will see that in some environment the
resulting string is not (as expected) 16 bytes long, but 17:

$ ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ LC_ALL="" ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ LC_ALL=de_DE.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 17 <*******************************************
[� ]
$ LC_ALL=de_DE.UTF-8 ./a.out | od -t x1
0000000 73 74 72 6c 65 6e 20 6f 66 20 62 75 66 20 77 69
0000020 74 68 20 61 73 63 69 69 20 63 68 61 72 20 31 36
0000040 0a 5b 61 20 20 20 20 20 20 20 20 20 20 20 20 20
0000060 20 20 5d 0a 73 74 72 6c 65 6e 20 6f 66 20 62 75
0000100 66 20 75 74 66 20 63 68 61 72 20 31 37 0a 5b c3
0000120 84 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
0000140 5d 0a
0000142

i.e. the problem shows up when the source buffer contains a 2-byte UTF-8
char and you
1) have LC_ALL=de_DE.UTF-8 in the env *and*
2) set this back inside to LC_ALL=""

you can also see that the output is plain 2 byte UTF-8 code for the German
letter A with dots, followed by 15 chars of blank, which gives 17 chars in
the case of "�" (and 16 in the case of "a");

the behaviour is the same for SunOS 5.9 and SunOS 5.10, but not on
FreeBSD 8.x and not on Linux SLES10;

the man page of setlocale(3C) does not mention any influence of the
settings on sprintf(3C), but on things (logically) like strftime,
ctype, ...

what does this mean? is this a bug? IMHO sprintf(3C) should
just add bytes to a buffer as described in its format string and should
count a string of 2-bytes (the UTF-8 �) as two bytes, regardless what the
two bytes mean, and should fill the rest of the buffer with (in our case)
14 blanks;

Any idea or any pointer to an explanation?

Thanks in advance

Matthias
--
http://www.unixarea.de/
From: rebelde on
Drazen Kacar wrote:

> With your example program on my system (Solaris 10):
>
> {morrigan}~/trash> cc loc.c
> "loc.c", line 7: warning: old-style declaration or incorrect type for:
> main
> {morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out
> strlen of buf with ascii char 16
> [a � � � � � � � ]
> strlen of buf utf char 17
> [� � � � � � � � �]
> {morrigan}~/trash> c99 loc.c
> "loc.c", line 7: warning: old-style declaration or incorrect type for:
> main
> {morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out
> strlen of buf with ascii char 16
> [a � � � � � � � ]
> strlen of buf utf char 16
> [� � � � � � � � ]
>
>> what does this mean? is this a bug?
>
> Take a look at standards(5) and define your compilation environment to
> better suite your needs. (Invoking c99 is just the simplest way to get
> standard conforming environment. It's not necessarily the best for your
> needs.)
>

Hello Drazen,

Thanks for your reply and hints. I've checked before standards(5) and it was
not really clear for me what was meant with 'columns of screen display';
now I understand what the idea is... in our case, the result of the
sprintf(3C) is to be stored in database columns and need to be the exact
number of bytes, rather something longer.

We're using a gcc

$ gcc --version
gcc (GCC) 3.4.6
....

which does not know the -xc99 flag:

$ gcc -xc99 str.c
gcc: language c99 not recognized

Will check what would be the best way to solve this...
Thanks again

Matthias
--
Matthias Apitz
t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211
e <guru(a)unixarea.de> - w http://www.unixarea.de/
Solidarity with the zionistic pirates of Israel? Not in my name!
�Solidaridad con los piratas sionistas de Israel? �No en mi nombre!

From: Paul Floyd on
On Tue, 01 Jun 2010 15:53:58 +0200, rebelde <guru(a)unixarea.de> wrote:
> number of bytes, rather something longer.
>
> We're using a gcc

If you want standards, the Sun Studio is better.

> $ gcc --version
> gcc (GCC) 3.4.6
> ...

gcc -std=c99 -pedantic is the equivalent.

A bientot
Paul
--
Paul Floyd http://paulf.free.fr
From: rebelde on
Paul Floyd wrote:

>> $ gcc --version
>> gcc (GCC) 3.4.6
>> ...
>
> gcc -std=c99 -pedantic is the equivalent.
>

But gives also 17 byte for %-16.16s in case of a UTF-8 char:

$ gcc -std=c99 -pedantic str.c
str.c:8: warning: return type defaults to `int'
$ LC_ALL="" ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ LC_ALL=de_DE.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 17
[� ]
$

matthias
--
Matthias Apitz
t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211
e <guru(a)unixarea.de> - w http://www.unixarea.de/
Solidarity with the zionistic pirates of Israel? Not in my name!
�Solidaridad con los piratas sionistas de Israel? �No en mi nombre!


From: Paul Floyd on
On Wed, 02 Jun 2010 10:17:30 +0200, rebelde <guru(a)unixarea.de> wrote:
> Paul Floyd wrote:
>
>>> $ gcc --version
>>> gcc (GCC) 3.4.6
>>> ...
>>
>> gcc -std=c99 -pedantic is the equivalent.
>>
>
> But gives also 17 byte for %-16.16s in case of a UTF-8 char:

OK, so GCC isn't conforming to the standards.

A bientot
Paul
--
Paul Floyd http://paulf.free.fr