From: PRMARJORAM on
At last its working.

CStringA alpha = W2A(item.mLine.c_str());
CStringW wide = CA2WEX<>(alpha,1251);

Seems after all this I did not need to re-compile my app to UNICODE as alls
i really needed to do was include the function call in the second line of
code above.

Assuming here processing and comparing CStringA or std::string is more
efficient than processing CStringW or std:wstring?

Thanks

Easy when you know how. :-)



"Giovanni Dicanio" wrote:

> PRMARJORAM ha scritto:
> > My application is compiled in UNICODE. I am downloading webpages using
> > cyrillic characters for their content. Although these files themselves are
> > ASCII.
> [...]
> > My problem is my CString containing this content is WCHAR and so I need to
> > convert 2 consecutive WCHAR to a single WCHAR to then get the correct
> > cyrillic code to display.
>
> I think that what I previously wrote may not be the right answer to your
> question.
>
> Could it be possible for you to clarify a little better the format of
> the input string?
>
> For example, in the Cyrillic code page 1251 I read here:
>
> http://www.fingertipsoft.com/ref/cyrillic/cp1251.html
>
> there is a character like an upper-case "K" (code: 202 dec, 0xCA hex).
>
> How is this character stored in your input string?
> What are the values of the two WCHAR's that you want to convert to one
> single WCHAR, in this particular case?
>
> Thanks,
> Giovanni
>
From: Giovanni Dicanio on
PRMARJORAM ha scritto:

> Again in a nutshell, im downloading webpages from foreign websites not
> necessarily using our charset and needing to display a subset of the textual
> content within a CListCtrl. I understand I also need to use specific fonts
> to acheive this once I have the correct string representation.
>
> After the cyrillic it will also need to work for other charsets such as
> Arabic etc.

I developed a small MFC test program to try to implement the idea for
converting an HTML web page to Unicode UTF-16, so the text can be
displayed and used inside Windows Unicode apps:

http://www.geocities.com/giovanni.dicanio/vc/HtmlTextDecoder.zip

Basically, there are 3 steps: the text is read as a "raw" char array;
the text is parsed to find the 'charset=' substring; the text is
converted to Unicode based on the value of charset.

This is the code as implemented in button-click handler:

<code>

//
// Load content of file in a raw char array.
//
std::vector<BYTE> fileContent;
if ( !
HtmlDecodeHelpers::ReadFileInCharArray(dlgOpenFile.GetPathName(),
fileContent) )
{
AfxMessageBox(IDS_ERROR_IN_OPENING_FILE, MB_OK|MB_ICONERROR);
return;
}


//
// Extract 'charset' field from the loaded HTML file.
//
std::string charset =
HtmlDecodeHelpers::ParseCharsetFromHTMLFile(fileContent);
if (charset.empty() )
{
// Charset not found
AfxMessageBox(IDS_ERROR_CHARSET_NOT_FOUND, MB_OK|MB_ICONERROR);
return;
}


//
// Convert loaded file to Unicode, basing on charset specification.
//
std::wstring unicodeContent;
if (!
HtmlDecodeHelpers::ConvertToUnicodeBasedOnCharset(fileContent, charset,
unicodeContent))
{
AfxMessageBox(IDS_ERROR_IN_UNICODE_CONVERSION, MB_OK|MB_ICONERROR);
return;
}


//
// Show converted text.
//
m_txtConverted.SetWindowText(unicodeContent.c_str());


</code>

The code is not perfect and needs more testing (and the parsing
algorithm should be improved), but in simple tests I performed it seems
to me to work fine (I tried it on a Latin 1 code page, and a Cyrillic
code page).

The core functions are those in namespace HtmlDecoderHelpers (in files
HtmlDecoderHelpers.h/.cpp).

For usage example, see method CTextDecoderDlg::OnBnClickedButtonLoadText().

There is also a subfolder called "Test" with a couple of test HTML files
I used.

The "core" function implementations follow:

<code>

//////////////////////////////////////////////////////////////////////////


#include "stdafx.h" // Pre-compiled headers
#include "HtmlDecodeHelpers.h" // Module header



//=======================================================================
// Reads the content of the specified file in an array of BYTEs.
// Returns 'true' on success, 'false' on error.
//=======================================================================
bool HtmlDecodeHelpers::ReadFileInCharArray(
IN const wchar_t * filename,
OUT std::vector<BYTE> & fileContent
)
{
ASSERT( filename != NULL );

// Empty destination array
fileContent.clear();

// Open file for reading
CFile file;
if ( ! file.Open( filename, CFile::modeRead ) )
{
// Error in opening file
return false;
}

// Get file length, in bytes
ULONGLONG fileLen = file.GetLength();

// Assume that file length is not big enough (< 2GB)
ASSERT( fileLen < 0x7FFFFFFF );

// Store file size
size_t sizeInBytes = static_cast<size_t>( fileLen );

// Resize vector to store file content
fileContent.resize( sizeInBytes );

// Read file content in vector
size_t readCount = file.Read( &fileContent[0], sizeInBytes );
ASSERT( readCount == sizeInBytes );

// Close file
file.Close();

// All right
return true;
}


//=======================================================================
// Given an HTML file content, returns the 'charset' value.
// On error, returns an empty string.
//=======================================================================
std::string HtmlDecodeHelpers::ParseCharsetFromHTMLFile(
IN const std::vector<BYTE> & fileContent
)
{
//
// Find the 'charset' attribute.
//
// To do so, build a string based on file content,
// and call std::string::find method on it.
//
//
// Typical HTML charset specification is as follows:
//
// <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
//

// Build an MBCS string from file content
const char * source = reinterpret_cast<const char *>(
&fileContent[0] );
std::string fileContentString( source, fileContent.size() );

// Find 'charset=' substring
size_t charsetIndex = fileContentString.find( "charset=" );
if (charsetIndex == std::string::npos)
{
// Error: no charset specification found
return "";
}

//
// charset=
// ||||||||
// 01234567 --> len = 8
//
const size_t charsetLen = 8;
size_t charsetValueIndex = charsetIndex + charsetLen;

// Now find the " symbol, that should close the charset specification.
const char endQuote = '\"';
size_t endQuoteIndex = fileContentString.find( endQuote,
charsetValueIndex );
if ( endQuoteIndex == std::string::npos )
{
// Error: no charset specification found
return "";
}

// Extract the charset value
std::string charsetValue = fileContentString.substr(
charsetValueIndex, endQuoteIndex - charsetValueIndex );

// Return it to the caller
return charsetValue;
}


//=======================================================================
// Given a file content and a charset specification, returns a Unicode
// string obtained from the input file content string, using proper
// encoding (as specified by charset).
// Returns 'true' on success, 'false' on error.
//=======================================================================
bool HtmlDecodeHelpers::ConvertToUnicodeBasedOnCharset(
IN const std::vector<BYTE> & fileContent,
IN const std::string & charset,
OUT std::wstring & unicodeContent
)
{
// Clear output parameter
unicodeContent.clear();

// There must be something in file content
ASSERT( ! fileContent.empty() );
if ( fileContent.empty() )
return false;

// Charset must be specified
ASSERT( ! charset.empty() );
if ( charset.empty() )
return false;


//
// A list of codepage identifiers for MultiByteToWideChar is
available here:
//
// http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
//
//
// This is a map from charset specification to code page value,
// to be used in MultiByteToWideChar()
//
typedef std::map< std::string, UINT > CharsetCodePageMap;
CharsetCodePageMap charsetToCodePage;

charsetToCodePage["iso-8859-1"] = 28591; // ISO 8859-1 Latin 1;
Western European (ISO)
charsetToCodePage["iso-8859-2"] = 28592; // ISO 8859-2 Central
European; Central European (ISO)
charsetToCodePage["iso-8859-7"] = 28597; // ISO 8859-7 Greek

charsetToCodePage["windows-1251"] = 1251; // ANSI Cyrillic;
Cyrillic (Windows)
charsetToCodePage["koi8-u"] = 21866; // Ukrainian (KOI8-U);
Cyrillic (KOI8-U)

charsetToCodePage["utf-8"] = 65001; // Unicode (UTF-8)
charsetToCodePage["utf-7"] = 65000; // Unicode (UTF-7)

// TODO: Add more entries here...

// TODO: This map could be built statically and not each time the
function is called.


// Given codepage string identifier (in 'charset'),
// extracts the integer ID for MultiByteToWideChar
CharsetCodePageMap::const_iterator it;
it = charsetToCodePage.find( charset );
if ( it == charsetToCodePage.end() )
{
// Code page not found in table
return false;
}

// Get code page ID value
UINT codePage = it->second;



//
// Convert the original text to a Unicode string, with specified
codepage
//

// Request size of destination buffer for Unicode string
int destBufferChars = ::MultiByteToWideChar(
codePage, // code page for conversion
0, // default flags
reinterpret_cast<LPCSTR>( &fileContent[0] ), // string to convert
fileContent.size(), // size in bytes of input string
NULL, // destination Unicode buffer
0 // request size of destination buffer, in
WCHAR's
);
if (destBufferChars == 0)
{
// Failure
return false;
}

// Add +1 to destination buffer size, because we are going to
terminate it with a L'\0'
++destBufferChars;

// Allocate buffer for destination string
std::vector< WCHAR > destBuffer(destBufferChars);

// Convert string to Unicode
int conversionResult = ::MultiByteToWideChar(
codePage, // code page for conversion
0, // default flags
reinterpret_cast<LPCSTR>( &fileContent[0] ), // string to convert
fileContent.size(), // size in bytes of input string
&destBuffer[0], // destination Unicode buffer
destBufferChars // size of destination buffer, in WCHAR's
);
if (conversionResult == 0)
{
// Failure
return false;
}

// Terminate Unicode string with \0
destBuffer[destBufferChars - 1] = L'\0';

// Return the Unicode string in output parameter
unicodeContent = std::wstring(&destBuffer[0]);

// All right
return true;
}


//////////////////////////////////////////////////////////////////////////

</code>


Giovanni

From: Giovanni Dicanio on
PRMARJORAM ha scritto:
> At last its working.
>
> CStringA alpha = W2A(item.mLine.c_str());

The above should be CW2A (the W2A macro is an obsolete macro from ATL
3.0, with some problems; the new C<X>2<Y> macros from ATL 7+ should be
used).


> Assuming here processing and comparing CStringA or std::string is more
> efficient than processing CStringW or std:wstring?

I don't think that this question makes much programming sense; I mean:
if you have to represent text in Unicode UTF-16, you have to use
CStringW or std::wstring instead of CStringA/std::string.

CStringA/std::string are fine if you want to use e.g. Unicode UTF-8.

(Or maybe I misunderstood your question?)

Giovanni

From: Joseph M. Newcomer on
See below...
On Thu, 10 Sep 2009 02:07:01 -0700, PRMARJORAM <PRMARJORAM(a)discussions.microsoft.com>
wrote:

>At last its working.
>
>CStringA alpha = W2A(item.mLine.c_str());
>CStringW wide = CA2WEX<>(alpha,1251);
>
>Seems after all this I did not need to re-compile my app to UNICODE as alls
>i really needed to do was include the function call in the second line of
>code above.
****
Unicode would have been good; the error is in the W2A macro. The item_mLine.c_str() is
clearly an 8-bit string of <char> types. Note that it is generally a Really Bad Idea to
mix std::string and CString data types, because std::string gives no discernable advantage
and ends up causing confusion.

If you had simply applied the CA2WEX call to the item.mLine.c_str(), it should have
worked, because presumably the item_mLine is a std::string. The intermediate step should
not have been necessary.
>
>Assuming here processing and comparing CStringA or std::string is more
>efficient than processing CStringW or std:wstring?
****
Trivially. Seriously, the differences hardly matter. In the absence of any proof, you
can assume there is no important cost difference to using wide strings. Note that the
entire cost of the extra few bytes will be lost in the orders-of-magnitude larger delays
in individual packets from the network.
joe
****
>
>Thanks
>
>Easy when you know how. :-)
>
>
>
>"Giovanni Dicanio" wrote:
>
>> PRMARJORAM ha scritto:
>> > My application is compiled in UNICODE. I am downloading webpages using
>> > cyrillic characters for their content. Although these files themselves are
>> > ASCII.
>> [...]
>> > My problem is my CString containing this content is WCHAR and so I need to
>> > convert 2 consecutive WCHAR to a single WCHAR to then get the correct
>> > cyrillic code to display.
>>
>> I think that what I previously wrote may not be the right answer to your
>> question.
>>
>> Could it be possible for you to clarify a little better the format of
>> the input string?
>>
>> For example, in the Cyrillic code page 1251 I read here:
>>
>> http://www.fingertipsoft.com/ref/cyrillic/cp1251.html
>>
>> there is a character like an upper-case "K" (code: 202 dec, 0xCA hex).
>>
>> How is this character stored in your input string?
>> What are the values of the two WCHAR's that you want to convert to one
>> single WCHAR, in this particular case?
>>
>> Thanks,
>> Giovanni
>>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
You don't need a char buff; you can use CStringA.
joe

On Wed, 09 Sep 2009 21:34:00 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>> My application is compiled in UNICODE. I am downloading webpages using
>> cyrillic characters for their content. Although these files themselves are
>> ASCII.
>
>Then the content does not belong in a CString.
>- download the stuff in a char buffer
>- detect the encoding (from the http header or the meta tag in the buffer)
>- convert to Unicode using MultiByteToWideChar (and store in CString)
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm