From: PRMARJORAM on
My application is compiled in UNICODE. I am downloading webpages using
cyrillic characters for their content. Although these files themselves are
ASCII.

---
Based on the encoding setting within the webpage or the users browser
determines how the content is to be interpreted and then displayed when used
within a browser.
---

My problem is my CString containing this content is WCHAR and so I need to
convert 2 consecutive WCHAR to a single WCHAR to then get the correct
cyrillic code to display.

Im not clear how to walk through the string doing this, assuming it not
simply adding the two WCHAR values together?

Can anyone clarify this issue?

Thanks.

From: David Wilkinson on
PRMARJORAM wrote:
> My application is compiled in UNICODE. I am downloading webpages using
> cyrillic characters for their content. Although these files themselves are
> ASCII.
>
> ---
> Based on the encoding setting within the webpage or the users browser
> determines how the content is to be interpreted and then displayed when used
> within a browser.
> ---
>
> My problem is my CString containing this content is WCHAR and so I need to
> convert 2 consecutive WCHAR to a single WCHAR to then get the correct
> cyrillic code to display.
>
> Im not clear how to walk through the string doing this, assuming it not
> simply adding the two WCHAR values together?
>
> Can anyone clarify this issue?

If you know the code page of the web site then you can convert to UTF-16
(Windows Unicode) using MultiByteToWideChar() function.

--
David Wilkinson
Visual C++ MVP
From: Giovanni Dicanio on
PRMARJORAM ha scritto:
> My application is compiled in UNICODE. I am downloading webpages using
> cyrillic characters for their content. Although these files themselves are
> ASCII.
[...]
> My problem is my CString containing this content is WCHAR and so I need to
> convert 2 consecutive WCHAR to a single WCHAR to then get the correct
> cyrillic code to display.


My understanding of your problem is as follows:

You have some text (coming from an ANSI webpage, using Cyrillic
codepage, i.e. something like Windows-1251).
This text is stored in an instance of CString, in a Unicode app (meaning
that CString is actually a CStringW, or if you are using Visual C++ 6,
CString is using WCHAR as TCHAR expansion).

You would like to have a CString with Unicode UTF-16 representation of
your Cyrillic characters.

Is this correct?

If so, I would use two passes conversions:

1. Convert your CString content from Unicode to ANSI, using your code
page (e.g. 1251).
You could use WideCharToMultiByte as David already suggested, or you may
use the easier CW2AEX helper class, specifiying proper code-page
identifier (e.g. 1251 for Windows-1251 Cyrillic) in the constructor.

For a list of code page identifiers you can look this:

"Code Page Identifiers"
http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx

(Note that C<X>2<Y> helper classes are available since VC++7.1, they are
not available in VC6, in case you are using this old one you must use
WideCharToMultiByte Win32 API).

2. Given the ANSI string (in Cyrillic code-page) returned in point #1,
you can convert it to Unicode, using MultiByteToWideChar or CA2WEX
helper class.

As a result of that, you will have a simple Unicode UTF-16 string
storing your Cyrillic characters.

HTH,
Giovanni
From: Giovanni Dicanio on
PRMARJORAM ha scritto:
> My application is compiled in UNICODE. I am downloading webpages using
> cyrillic characters for their content. Although these files themselves are
> ASCII.
[...]
> My problem is my CString containing this content is WCHAR and so I need to
> convert 2 consecutive WCHAR to a single WCHAR to then get the correct
> cyrillic code to display.

I think that what I previously wrote may not be the right answer to your
question.

Could it be possible for you to clarify a little better the format of
the input string?

For example, in the Cyrillic code page 1251 I read here:

http://www.fingertipsoft.com/ref/cyrillic/cp1251.html

there is a character like an upper-case "K" (code: 202 dec, 0xCA hex).

How is this character stored in your input string?
What are the values of the two WCHAR's that you want to convert to one
single WCHAR, in this particular case?

Thanks,
Giovanni
From: PRMARJORAM on
Giovanni, I must have explained the problem pretty well as you pretty much
have understood it. Yes the webpage in this particular instance im
downloading is as you specified.

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

Ok using a Binary Viewer on the first cyrillic code in the <title> tag is

CC B3

Which 'should' be a cyrillic capital M?

I hope this helps. Thanks again.






"Giovanni Dicanio" wrote:

> PRMARJORAM ha scritto:
> > My application is compiled in UNICODE. I am downloading webpages using
> > cyrillic characters for their content. Although these files themselves are
> > ASCII.
> [...]
> > My problem is my CString containing this content is WCHAR and so I need to
> > convert 2 consecutive WCHAR to a single WCHAR to then get the correct
> > cyrillic code to display.
>
> I think that what I previously wrote may not be the right answer to your
> question.
>
> Could it be possible for you to clarify a little better the format of
> the input string?
>
> For example, in the Cyrillic code page 1251 I read here:
>
> http://www.fingertipsoft.com/ref/cyrillic/cp1251.html
>
> there is a character like an upper-case "K" (code: 202 dec, 0xCA hex).
>
> How is this character stored in your input string?
> What are the values of the two WCHAR's that you want to convert to one
> single WCHAR, in this particular case?
>
> Thanks,
> Giovanni
>