From: Jack on
Hi,

I'm, a little confused about how to convert text encodings in a file
downloaded from the Internet to memory (via InternetReadFile()).

I download into a char buffer.

The text is UTF-8 encoded (I think).

I then parse the file into appropriate text fields

Now, for example, when I display text from the file in an edit control I get
"&" where "&" is required (I have added the quotes).

Now, is this an artifact of the encoding or is it a "hardcoded" html string
which has nothing to with the encoding?

ie can I translate "&" to "&" by using some for of MultiByteToWideChar()
( or similar) or must I use some sort of HTML parser?

All I need to do is remove these text "encodings" from the display fields so
that text displays correctly (my program is in UNICODE)

How should I go about this in the most efficient manner possible?

eg I want to convert "Hello & Goodbye" to "Hello & Goodbye"

TIA

Lastly, I hope this is an appropriate group - apologies if not.



From: David Wilkinson on
Jack wrote:
> Hi,
>
> I'm, a little confused about how to convert text encodings in a file
> downloaded from the Internet to memory (via InternetReadFile()).
>
> I download into a char buffer.
>
> The text is UTF-8 encoded (I think).
>
> I then parse the file into appropriate text fields
>
> Now, for example, when I display text from the file in an edit control I get
> "&" where "&" is required (I have added the quotes).
>
> Now, is this an artifact of the encoding or is it a "hardcoded" html string
> which has nothing to with the encoding?
>
> ie can I translate "&" to "&" by using some for of MultiByteToWideChar()
> ( or similar) or must I use some sort of HTML parser?
>
> All I need to do is remove these text "encodings" from the display fields so
> that text displays correctly (my program is in UNICODE)
>
> How should I go about this in the most efficient manner possible?
>
> eg I want to convert "Hello & Goodbye" to "Hello & Goodbye"
>
> TIA
>
> Lastly, I hope this is an appropriate group - apologies if not.

Jack:

I'm not a big expert on this kind of thing, but I think you need to

(a) Get rid of these character entities; for example replace & by the byte
value 38.

(b) Use MultiByteToWideChar with the CP_UTF8 code page to convert to wide
character unicode (UTF16).

--
David Wilkinson
Visual C++ MVP
From: Jack on

"David Wilkinson" <no-reply(a)effisols.com> wrote in message
news:%23CpXmfDtIHA.1236(a)TK2MSFTNGP02.phx.gbl...
> Jack wrote:
>> Hi,
>>

> Jack:
>
> I'm not a big expert on this kind of thing, but I think you need to
>
> (a) Get rid of these character entities; for example replace &#38; by the
> byte value 38.
>
> (b) Use MultiByteToWideChar with the CP_UTF8 code page to convert to wide
> character unicode (UTF16).
>

Hello David,

So you think I need to step through the data twice, once with an HTML parser
and once to convert the data format.


From: Giovanni Dicanio on

"Jack" <notaround(a)dontmail.com> ha scritto nel messaggio
news:L42dnV6ZOr9yo7XVRVnyvQA(a)pipex.net...


> I'm, a little confused about how to convert text encodings in a file
> downloaded from the Internet to memory (via InternetReadFile()).
>
> I download into a char buffer.
>
> The text is UTF-8 encoded (I think).

If you are sure that your text is UTF-8, I think that the first thing to do
is to convert from UTF-8 to UTF-16, when you receive that text.
This is because Windows APIs understand Unicode UTF-16. So, UTF-8 is fine
for transmitting data e.g. over the Internet, but UTF-16 is fine for
processing *inside* Windows applications.

To convert from UTF-8 to UTF-16, you can use MultiByteToWideChar API, and
you can read and use some code of mine that I shared on an MSDN forum:

MSDN Forums -> Visual C++ -> Visual C++ Language -> "Proeblem with some
Unicode chars"

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=3200146&SiteID=1


> Now, for example, when I display text from the file in an edit control I
> get
> "&#38;" where "&" is required (I have added the quotes).

Are you sure that your text is UTF-8, and not, for example ISO 8859-1
(Latin-1) ?

ISO 8859-1 (Latin-1)

http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html

I found sometimes that this encoding tends to use:

&#<decimal code>;

to represent some characters, for example: the "&" symbol has decimal code
38, and so can be represented as

&#38;

So, the first thing that you must be sure about is the kind of encoding of
your text (UTF-8 ? ISO 8859-1 Latin-1?)

Assuming that you still have these &#<...>; substrings after conversion
(e.g. from UTF-8 to UTF-16), I would parse this text, searching for
occurrences of these &#...; substrings, and convert them to corresponding
characters.
It is not hard. You may also use a regular-expression parser, like
CAtlRegExp:

http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx

HTH,
Giovanni


From: David Wilkinson on
Jack wrote:
> Hello David,
>
> So you think I need to step through the data twice, once with an HTML parser
> and once to convert the data format.

Jack:

Well, I am not sure how you are extracting your "text fields". But I would
think, each time you extract a text field, get rid of the HTML entities in it,
and then use MultiByteToWideChar(). This is only one pass through the file.

--
David Wilkinson
Visual C++ MVP