From: Mihai N. on

> You would like to have a CString with Unicode UTF-16 representation of
> your Cyrillic characters.

No. Most likely he has some junk, because the characters are some
Cyrillic code page (cp1251, or KOI8-R) and were converted to UTF-16
as if they were 1252.



--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on
> My application is compiled in UNICODE. I am downloading webpages using
> cyrillic characters for their content. Although these files themselves are
> ASCII.

Then the content does not belong in a CString.
- download the stuff in a char buffer
- detect the encoding (from the http header or the meta tag in the buffer)
- convert to Unicode using MultiByteToWideChar (and store in CString)


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on
> CC B3
>
> Which 'should' be a cyrillic capital M?

CC is Cyrillic capital M in cp1251
B3 is Cyrillic lowercase i in cp1251

You have junk in your CString.
See my previous post.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: PRMARJORAM on
It is a Ukrainian webpage.
Thanks everyone for your input, got alot to work on now. Will try all this
out.
Im hoping sometime today to have it working.

Again in a nutshell, im downloading webpages from foreign websites not
necessarily using our charset and needing to display a subset of the textual
content within a CListCtrl. I understand I also need to use specific fonts
to acheive this once I have the correct string representation.

After the cyrillic it will also need to work for other charsets such as
Arabic etc.

Thanks again. I shall post my results.



"Alexander Grigoriev" wrote:

> Well, CC is indeed cyrillic M in CP1251, though B3 maps to ukrainian 'i'
>
> "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
> news:uvjga5p7jm31771h0o7n4v7rvbomrh27mr(a)4ax.com...
> >I thought of that, but the problem is that thre are three ways to look at
> >the sequence
> > CCB3 (or B3CC)
> >
> > As two 8-bit characters: ̳ (that's capital I with grave accent followed
> > by a superscript
> > 3)
> > As a UTF-8 encoding: It doesn't decode into anything sensible
> > As a Unicode character: Neither UCCB3 nor UB3CC are valid characters.
> >
> > But I agree: it has to be stored as a CStringA or other 8-bit
> > representation.
> >
> > So the question is, what could this encoding mean. I tried all kinds of
> > encoding in the
> > Locale Explorer, and nothing worked out.
> > joe
> >
> > On Wed, 09 Sep 2009 23:43:54 +0200, Giovanni Dicanio
> > <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote:
> >
> >>
> >>PRMARJORAM ha scritto:
> >>
> >>> Giovanni, I must have explained the problem pretty well as you pretty
> >>> much
> >>> have understood it. Yes the webpage in this particular instance im
> >>> downloading is as you specified.
> >>>
> >>> <meta http-equiv="Content-Type" content="text/html;
> >>> charset=windows-1251">
> >>
> >>This text is explicitly stating that the code page is a Windows-1251, so
> >>it is an ANSI/MBCS string. I think that you should store this string in
> >>a CStringA, or in a std::string (i.e. in a string class based on char's,
> >>not on WCHAR's).
> >>
> >>Then you can use MultiByteToWideChar or CA2WEX to convert from this
> >>ANSI/MBCS string to Unicode string, and store the resulting Unicode
> >>string in a CStringW or std::wstring class (or just in a CString class
> >>if you use Unicode build, where CString's are based on WCHAR's).
> >>
> >>i.e. the original memory layout of your string should be something like
> >>this (bytes expressed in hex):
> >>
> >> <meta ...
> >>
> >> 3C 6D 65 74 61 ...
> >> '<' 'm' 'e' 't' 'a' ...
> >>
> >>It makes sense to store this in a std::string or CStringA, but *not* in
> >>a CStringW.
> >>
> >>Instead, if the memory layout of your text is something like this:
> >>
> >> 3C 00 6D 00 65 00 74 00 61 00 ...
> >> L'<' L'm' L'e' L't' L'a' ...
> >>
> >>then it might make sense to store this in a CStringW.
> >>However, this is kind of a "lie", a false statement, because you are
> >>using a Unicode string, but the 'charset' attribute is set to
> >>'windows-1251'.
> >>In this "strange" case, I would strip the 00 bytes from the input
> >>string, and convert it in the first form, i.e.
> >>
> >> 3C 6D 65 74 61 ...
> >>
> >>store it in a std::string or CStringA, and then call MultiByteToWideChar
> >>or CA2WEX using Windows-1251 code page identifier to get the proper
> >>Unicode UTF-16 string.
> >>
> >>HTH,
> >>Giovanni
> >>
> > Joseph M. Newcomer [MVP]
> > email: newcomer(a)flounder.com
> > Web: http://www.flounder.com
> > MVP Tips: http://www.flounder.com/mvp_tips.htm
>
>
>
From: PRMARJORAM on
Its not junk. Its exactly as you say.




"Mihai N." wrote:

> > CC B3
> >
> > Which 'should' be a cyrillic capital M?
>
> CC is Cyrillic capital M in cp1251
> B3 is Cyrillic lowercase i in cp1251
>
> You have junk in your CString.
> See my previous post.
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>
>