From: PRMARJORAM on
Joe in my journey to uncover the mystery of UNICODE I have come across quite
a few of your examples and they have helped alot. But what im stating here
was when I understood less what im trying to do than I do now.
These are simple ASCII extended codes. I assume when i convert these to
UNICODE using the code page parameter they will be the correct codes as you
have suggested for them displaying in my CListCtrl.

What i originally assumed about a webpage that was of this charset was that
it was interpreted as 2:1 characters to give the UNICODE value, but its still
1:1 but with a code page parameter.

"Joseph M. Newcomer" wrote:

> CC B3 is not a recognizable encoding. The Russian symbol that displays as "M" is code
> U041C, and it does not encode into CC B3. CCB3 does not decode into anything recognizably
> Unicode, nor does B3CC. For more details and the ability to experiment, I suggest
> downloading my Locale Explorer from my MVP Tips site.
>
> You need to know the encoding. (Note that I tried using Windows-1251 as well).
> joe
>
> On Wed, 9 Sep 2009 07:42:01 -0700, PRMARJORAM <PRMARJORAM(a)discussions.microsoft.com>
> wrote:
>
> >Giovanni, I must have explained the problem pretty well as you pretty much
> >have understood it. Yes the webpage in this particular instance im
> >downloading is as you specified.
> >
> ><meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
> >
> >Ok using a Binary Viewer on the first cyrillic code in the <title> tag is
> >
> >CC B3
> >
> >Which 'should' be a cyrillic capital M?
> >
> >I hope this helps. Thanks again.
> >
> >
> >
> >
> >
> >
> >"Giovanni Dicanio" wrote:
> >
> >> PRMARJORAM ha scritto:
> >> > My application is compiled in UNICODE. I am downloading webpages using
> >> > cyrillic characters for their content. Although these files themselves are
> >> > ASCII.
> >> [...]
> >> > My problem is my CString containing this content is WCHAR and so I need to
> >> > convert 2 consecutive WCHAR to a single WCHAR to then get the correct
> >> > cyrillic code to display.
> >>
> >> I think that what I previously wrote may not be the right answer to your
> >> question.
> >>
> >> Could it be possible for you to clarify a little better the format of
> >> the input string?
> >>
> >> For example, in the Cyrillic code page 1251 I read here:
> >>
> >> http://www.fingertipsoft.com/ref/cyrillic/cp1251.html
> >>
> >> there is a character like an upper-case "K" (code: 202 dec, 0xCA hex).
> >>
> >> How is this character stored in your input string?
> >> What are the values of the two WCHAR's that you want to convert to one
> >> single WCHAR, in this particular case?
> >>
> >> Thanks,
> >> Giovanni
> >>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
From: Giovanni Dicanio on
PRMARJORAM ha scritto:

> Plus when you compile your app to UNICODE all your CStrings change to WCHAR
> and you call Wide versions of everything.

I don't know which version of VC++ you are using.
If you are using a VC++ >= 7.1 (e.g. VC++7.1 in VS.NET 2003, VC8 in
VS2005, VC9 in VS2008...), then you can have both CStringA (CHAR-based)
and CStringW (WCHAR-based) in the same project.

Moreover, if you need to store an ANSI/MBCS string using a robust C++
class and you use VC6 (so in Unicode app you only have CString based on
WCHAR), you could use the STL class std::string.
In fact, std::string stores char's in both ANSI/MBCS and Unicode builds.

In particular, considering your problem, if the web pages that you get
use an ANSI/MBCS encoding (not Unicode), then I would suggest you to use
std::string or CStringA (instead of a WCHAR-based CString) to store them.

And you can call MultiByteToWideChar (or use CA2WEX class) to convert
from specific code page to Unicode, and then store the resulting Unicode
string in a CString (or use explicit CStringW) in your Unicode app, and
then show the Unicode strings in listviews or wherever you want.

Giovanni
From: Giovanni Dicanio on
PRMARJORAM ha scritto:

> Giovanni, I must have explained the problem pretty well as you pretty much
> have understood it. Yes the webpage in this particular instance im
> downloading is as you specified.
>
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

This text is explicitly stating that the code page is a Windows-1251, so
it is an ANSI/MBCS string. I think that you should store this string in
a CStringA, or in a std::string (i.e. in a string class based on char's,
not on WCHAR's).

Then you can use MultiByteToWideChar or CA2WEX to convert from this
ANSI/MBCS string to Unicode string, and store the resulting Unicode
string in a CStringW or std::wstring class (or just in a CString class
if you use Unicode build, where CString's are based on WCHAR's).

i.e. the original memory layout of your string should be something like
this (bytes expressed in hex):

<meta ...

3C 6D 65 74 61 ...
'<' 'm' 'e' 't' 'a' ...

It makes sense to store this in a std::string or CStringA, but *not* in
a CStringW.

Instead, if the memory layout of your text is something like this:

3C 00 6D 00 65 00 74 00 61 00 ...
L'<' L'm' L'e' L't' L'a' ...

then it might make sense to store this in a CStringW.
However, this is kind of a "lie", a false statement, because you are
using a Unicode string, but the 'charset' attribute is set to
'windows-1251'.
In this "strange" case, I would strip the 00 bytes from the input
string, and convert it in the first form, i.e.

3C 6D 65 74 61 ...

store it in a std::string or CStringA, and then call MultiByteToWideChar
or CA2WEX using Windows-1251 code page identifier to get the proper
Unicode UTF-16 string.

HTH,
Giovanni


From: Joseph M. Newcomer on
I thought of that, but the problem is that thre are three ways to look at the sequence
CCB3 (or B3CC)

As two 8-bit characters: ̳ (that's capital I with grave accent followed by a superscript
3)
As a UTF-8 encoding: It doesn't decode into anything sensible
As a Unicode character: Neither UCCB3 nor UB3CC are valid characters.

But I agree: it has to be stored as a CStringA or other 8-bit representation.

So the question is, what could this encoding mean. I tried all kinds of encoding in the
Locale Explorer, and nothing worked out.
joe

On Wed, 09 Sep 2009 23:43:54 +0200, Giovanni Dicanio
<giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote:

>
>PRMARJORAM ha scritto:
>
>> Giovanni, I must have explained the problem pretty well as you pretty much
>> have understood it. Yes the webpage in this particular instance im
>> downloading is as you specified.
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
>
>This text is explicitly stating that the code page is a Windows-1251, so
>it is an ANSI/MBCS string. I think that you should store this string in
>a CStringA, or in a std::string (i.e. in a string class based on char's,
>not on WCHAR's).
>
>Then you can use MultiByteToWideChar or CA2WEX to convert from this
>ANSI/MBCS string to Unicode string, and store the resulting Unicode
>string in a CStringW or std::wstring class (or just in a CString class
>if you use Unicode build, where CString's are based on WCHAR's).
>
>i.e. the original memory layout of your string should be something like
>this (bytes expressed in hex):
>
> <meta ...
>
> 3C 6D 65 74 61 ...
> '<' 'm' 'e' 't' 'a' ...
>
>It makes sense to store this in a std::string or CStringA, but *not* in
>a CStringW.
>
>Instead, if the memory layout of your text is something like this:
>
> 3C 00 6D 00 65 00 74 00 61 00 ...
> L'<' L'm' L'e' L't' L'a' ...
>
>then it might make sense to store this in a CStringW.
>However, this is kind of a "lie", a false statement, because you are
>using a Unicode string, but the 'charset' attribute is set to
>'windows-1251'.
>In this "strange" case, I would strip the 00 bytes from the input
>string, and convert it in the first form, i.e.
>
> 3C 6D 65 74 61 ...
>
>store it in a std::string or CStringA, and then call MultiByteToWideChar
>or CA2WEX using Windows-1251 code page identifier to get the proper
>Unicode UTF-16 string.
>
>HTH,
>Giovanni
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Alexander Grigoriev on
Well, CC is indeed cyrillic M in CP1251, though B3 maps to ukrainian 'i'

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
news:uvjga5p7jm31771h0o7n4v7rvbomrh27mr(a)4ax.com...
>I thought of that, but the problem is that thre are three ways to look at
>the sequence
> CCB3 (or B3CC)
>
> As two 8-bit characters: ̳ (that's capital I with grave accent followed
> by a superscript
> 3)
> As a UTF-8 encoding: It doesn't decode into anything sensible
> As a Unicode character: Neither UCCB3 nor UB3CC are valid characters.
>
> But I agree: it has to be stored as a CStringA or other 8-bit
> representation.
>
> So the question is, what could this encoding mean. I tried all kinds of
> encoding in the
> Locale Explorer, and nothing worked out.
> joe
>
> On Wed, 09 Sep 2009 23:43:54 +0200, Giovanni Dicanio
> <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote:
>
>>
>>PRMARJORAM ha scritto:
>>
>>> Giovanni, I must have explained the problem pretty well as you pretty
>>> much
>>> have understood it. Yes the webpage in this particular instance im
>>> downloading is as you specified.
>>>
>>> <meta http-equiv="Content-Type" content="text/html;
>>> charset=windows-1251">
>>
>>This text is explicitly stating that the code page is a Windows-1251, so
>>it is an ANSI/MBCS string. I think that you should store this string in
>>a CStringA, or in a std::string (i.e. in a string class based on char's,
>>not on WCHAR's).
>>
>>Then you can use MultiByteToWideChar or CA2WEX to convert from this
>>ANSI/MBCS string to Unicode string, and store the resulting Unicode
>>string in a CStringW or std::wstring class (or just in a CString class
>>if you use Unicode build, where CString's are based on WCHAR's).
>>
>>i.e. the original memory layout of your string should be something like
>>this (bytes expressed in hex):
>>
>> <meta ...
>>
>> 3C 6D 65 74 61 ...
>> '<' 'm' 'e' 't' 'a' ...
>>
>>It makes sense to store this in a std::string or CStringA, but *not* in
>>a CStringW.
>>
>>Instead, if the memory layout of your text is something like this:
>>
>> 3C 00 6D 00 65 00 74 00 61 00 ...
>> L'<' L'm' L'e' L't' L'a' ...
>>
>>then it might make sense to store this in a CStringW.
>>However, this is kind of a "lie", a false statement, because you are
>>using a Unicode string, but the 'charset' attribute is set to
>>'windows-1251'.
>>In this "strange" case, I would strip the 00 bytes from the input
>>string, and convert it in the first form, i.e.
>>
>> 3C 6D 65 74 61 ...
>>
>>store it in a std::string or CStringA, and then call MultiByteToWideChar
>>or CA2WEX using Windows-1251 code page identifier to get the proper
>>Unicode UTF-16 string.
>>
>>HTH,
>>Giovanni
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm