From: Marcel Bruinsma on
Am Freitag, 25. September 2009 01:06, Bill Marcum a écrit :

> On 2009-09-24, Marcel Bruinsma <mb(a)nomail.afraid.org> wrote:
>
>> No, the default CTYPE for de is ISO-8859-1.
>
> CP1252 is a superset of ISO-8859-1. The accented letters are the
> same. CP1252 has additional punctuation marks and copyright and
> trademark symbols, among other things (code values 128-159 which
> are undefined in the ISO-8859-* character sets.)

Exactly. Amongst those 'other things' are the frequently used
quotation marks (U+2018..U+201F) :

→ printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
iconv: Séquence d'échappement illégale à la position 0
→ printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
“„”

--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
From: syd_p on
On 25 Sep, 03:31, Marcel Bruinsma <m...(a)nomail.afraid.org> wrote:
> Am Freitag, 25. September 2009 01:06, Bill Marcum a écrit :
>
> > On 2009-09-24, Marcel Bruinsma <m...(a)nomail.afraid.org> wrote:
>
> >> No, the default CTYPE for de is ISO-8859-1.
>
> > CP1252 is a superset of ISO-8859-1.  The accented letters are the
> > same. CP1252 has additional punctuation marks and copyright and
> > trademark symbols, among other things (code values 128-159 which
> > are undefined in the ISO-8859-* character sets.)
>
> Exactly. Amongst those ‘other things’ are the frequently used
> quotation marks (U+2018..U+201F) :
>
> → printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
> iconv: Séquence d'échappement illégale à la position 0
> → printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
> “„”
>
> --
> printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
> 156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \  \\\\)
> #  Live every life as if it were your last!  #
Aha! I get the output below-
Not quite sure how you did the printf above tho.
And not quite sure what I should set to say LANG and LC_ALL to en_us
first and check that out?
then set to en_us.CP1252.
I did not originally set up the box (actually there are 6 or 8 of
them) but I think that LANG=C was done cos there was a problem with
LANG-en_us.
Gotta go careful here, cos I guess I have to reboot to test.
Thanks a lot for the help - I am getting there!!!


$ locale -m | grep '^CP'
CP10007
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP737
CP775
CP949
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
From: Marcel Bruinsma on
Am Sonntag, 27. September 2009 23:40, syd_p a écrit :

>> → printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
>> iconv: Séquence d'échappement illégale à la position 0
>> → printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
>> “„”
>
> Not quite sure how you did the printf above tho.

The three quotes above are actually encoded in UTF-8,
because that is what my terminal understands.

The first iconv on the second printf line converts from
UTF-8 (my default in LANG) to CP1252 and doesn't
report an error, meaning that those characters are
valid in CP1252 encoding. The second iconv does the
inverse : translate from CP1252 to UTF-8, and the
result is the original string.

The first printf passes the same UTF-8 encoded quotes
to iconv, but asks to convert to latin1 (ISO-8859-1), and
this time iconv says "illegal input sequence", because
these quotes do not exist in latin1.

> And not quite sure what I should set to say LANG
> and LC_ALL to en_us first and check that out?

Try,

LANG=en_US.CP1252 locale
LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-1 locale
LANG=en_US.UTF-8 locale

and see, if any of these does *not* produce an error
like this :

$ LANG=en_US.FOO locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

Obviously, character encoding FOO doesn't exist.

> I did not originally set up the box (actually there are 6 or 8
> of them) but I think that LANG=C was done cos there was
> a problem with LANG-en_us.

Anything is possible, but centos 3.8 isn't that old.

In your OP you write :
« However there are some special characters (u with 2 dots
» overhead, for example) in the data which appear as ? in
» the linux file created. »

Is that a normal question mark, or is it inverse (white in
a black hexagon or square), like this : �

In the latter case, all you would have to do is convert the
output from the db application with 'iconv -fcp1252 -tutf8'.

--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
From: syd_p on
On 28 Sep, 01:04, Marcel Bruinsma <m...(a)nomail.afraid.org> wrote:
> Am Sonntag, 27. September 2009 23:40, syd_p a écrit :
>
> >> → printf '“„”\n' | iconv -tlatin1 | iconv -flatin1
> >> iconv: Séquence d'échappement illégale à la position 0
> >> → printf '“„”\n' | iconv -tcp1252 | iconv -fcp1252
> >> “„”
>
> > Not quite sure how you did the printf above tho.
>
> The three quotes above are actually encoded in UTF-8,
> because that is what my terminal understands.
>
> The first iconv on the second printf line converts from
> UTF-8 (my default in LANG) to CP1252 and doesn't
> report an error, meaning that those characters are
> valid in CP1252 encoding. The second iconv does the
> inverse : translate from CP1252 to UTF-8, and the
> result is the original string.
>
> The first printf passes the same UTF-8 encoded quotes
> to iconv, but asks to convert to latin1 (ISO-8859-1), and
> this time iconv says "illegal input sequence", because
> these quotes do not exist in latin1.
>
> > And not quite sure what I should set to say LANG
> > and LC_ALL to en_us first and check that out?
>
> Try,
>
> LANG=en_US.CP1252 locale
> LANG=en_US.ISO-8859-15 locale
> LANG=en_US.ISO-8859-1 locale
> LANG=en_US.UTF-8 locale
>
> and see, if any of these does *not* produce an error
> like this :
>
> $ LANG=en_US.FOO locale
> locale: Cannot set LC_CTYPE to default locale: No such file or directory
> locale: Cannot set LC_MESSAGES to default locale: No such file or directory
> locale: Cannot set LC_ALL to default locale: No such file or directory
>
> Obviously, character encoding FOO doesn't exist.
>
> > I did not originally set up the box (actually there are 6 or 8
> > of  them) but I think that LANG=C was done cos there was
> > a problem with LANG-en_us.
>
> Anything is possible, but centos 3.8 isn't that old.
>
> In your OP you write :
> « However there are some special characters (u with 2 dots
> » overhead, for example) in the data which appear as ? in
> » the linux file created. »
>
> Is that a normal question mark, or is it inverse (white in
> a black hexagon or square), like this :
>
> In the latter case, all you would have to do is convert the
> output from the db application with 'iconv -fcp1252 -tutf8'.
>
> --
> printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
> 156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \  \\\\)
> #  Live every life as if it were your last!  #
Thanks!!!!!

It is a normal question mark.

I entered the commands as suggested
> LANG=en_US.CP1252 locale -> Bad
> LANG=en_US.ISO-8859-15 locale -> Good
> LANG=en_US.ISO-8859-1 locale -> Good
> LANG=en_US.UTF-8 locale -> Good

++++
$ LANG=en_US.CP1252 locale
locale: Cannot set LC_CTYPE to default locale: No such file or
directory
locale: Cannot set LC_MESSAGES to default locale: No such file or
directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_PAPER="en_US.CP1252"
LC_NAME="en_US.CP1252"
LC_ADDRESS="en_US.CP1252"
LC_TELEPHONE="en_US.CP1252"
LC_MEASUREMENT="en_US.CP1252"
LC_IDENTIFICATION="en_US.CP1252"
LC_ALL=

]$ LANG=en_US.CP1252 locale
locale: Cannot set LC_CTYPE to default locale: No such file or
directory
locale: Cannot set LC_MESSAGES to default locale: No such file or
directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_PAPER="en_US.CP1252"
LC_NAME="en_US.CP1252"
LC_ADDRESS="en_US.CP1252"
LC_TELEPHONE="en_US.CP1252"
LC_MEASUREMENT="en_US.CP1252"
LC_IDENTIFICATION="en_US.CP1252"
LC_ALL=

]$ LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-15
LC_CTYPE="en_US.ISO-8859-15"
LC_NUMERIC="en_US.ISO-8859-15"
LC_TIME="en_US.ISO-8859-15"
LC_COLLATE="en_US.ISO-8859-15"
LC_MONETARY="en_US.ISO-8859-15"
LC_MESSAGES="en_US.ISO-8859-15"
LC_PAPER="en_US.ISO-8859-15"
LC_NAME="en_US.ISO-8859-15"
LC_ADDRESS="en_US.ISO-8859-15"
LC_TELEPHONE="en_US.ISO-8859-15"
LC_MEASUREMENT="en_US.ISO-8859-15"
LC_IDENTIFICATION="en_US.ISO-8859-15"
LC_ALL=
[netcool(a)impact01 netcool]$ LANG=en_US.ISO-8859-1 locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=

$ LANG=en_US.UTF-8 locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
From: syd_p on
OK I got this:
> LANG=en_US.CP1252 locale -> Bad
> LANG=en_US.ISO-8859-15 locale -> Good
> LANG=en_US.ISO-8859-1 locale -> Good
> LANG=en_US.UTF-8 locale -> Good

But I am not sure how to proceed. I have this from CP1252 "ë 00EB 235"
which I want to handle in centos 3.8.
And the glibc supports CP1252
$ locale -m | grep '^CP'
....
CP1252

But "LANG=en_US.CP1252 locale" does not work.

But with LANG=C which I thought was only 7 bits the following printfs
work just fine.

$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
These are two of the characters in the MSSQL db which the application
(not open source) handles as "?".

Puzzled now!
Please help!