|
Prev: Anyone know how to force screen update after each GDI call -for debugging?
Next: Function is not recognizing
From: Jack on 12 May 2008 08:58 Hi, I'm, a little confused about how to convert text encodings in a file downloaded from the Internet to memory (via InternetReadFile()). I download into a char buffer. The text is UTF-8 encoded (I think). I then parse the file into appropriate text fields Now, for example, when I display text from the file in an edit control I get "&" where "&" is required (I have added the quotes). Now, is this an artifact of the encoding or is it a "hardcoded" html string which has nothing to with the encoding? ie can I translate "&" to "&" by using some for of MultiByteToWideChar() ( or similar) or must I use some sort of HTML parser? All I need to do is remove these text "encodings" from the display fields so that text displays correctly (my program is in UNICODE) How should I go about this in the most efficient manner possible? eg I want to convert "Hello & Goodbye" to "Hello & Goodbye" TIA Lastly, I hope this is an appropriate group - apologies if not.
From: David Wilkinson on 12 May 2008 09:55 Jack wrote: > Hi, > > I'm, a little confused about how to convert text encodings in a file > downloaded from the Internet to memory (via InternetReadFile()). > > I download into a char buffer. > > The text is UTF-8 encoded (I think). > > I then parse the file into appropriate text fields > > Now, for example, when I display text from the file in an edit control I get > "&" where "&" is required (I have added the quotes). > > Now, is this an artifact of the encoding or is it a "hardcoded" html string > which has nothing to with the encoding? > > ie can I translate "&" to "&" by using some for of MultiByteToWideChar() > ( or similar) or must I use some sort of HTML parser? > > All I need to do is remove these text "encodings" from the display fields so > that text displays correctly (my program is in UNICODE) > > How should I go about this in the most efficient manner possible? > > eg I want to convert "Hello & Goodbye" to "Hello & Goodbye" > > TIA > > Lastly, I hope this is an appropriate group - apologies if not. Jack: I'm not a big expert on this kind of thing, but I think you need to (a) Get rid of these character entities; for example replace & by the byte value 38. (b) Use MultiByteToWideChar with the CP_UTF8 code page to convert to wide character unicode (UTF16). -- David Wilkinson Visual C++ MVP
From: Jack on 12 May 2008 10:31 "David Wilkinson" <no-reply(a)effisols.com> wrote in message news:%23CpXmfDtIHA.1236(a)TK2MSFTNGP02.phx.gbl... > Jack wrote: >> Hi, >> > Jack: > > I'm not a big expert on this kind of thing, but I think you need to > > (a) Get rid of these character entities; for example replace & by the > byte value 38. > > (b) Use MultiByteToWideChar with the CP_UTF8 code page to convert to wide > character unicode (UTF16). > Hello David, So you think I need to step through the data twice, once with an HTML parser and once to convert the data format.
From: Giovanni Dicanio on 12 May 2008 10:48 "Jack" <notaround(a)dontmail.com> ha scritto nel messaggio news:L42dnV6ZOr9yo7XVRVnyvQA(a)pipex.net... > I'm, a little confused about how to convert text encodings in a file > downloaded from the Internet to memory (via InternetReadFile()). > > I download into a char buffer. > > The text is UTF-8 encoded (I think). If you are sure that your text is UTF-8, I think that the first thing to do is to convert from UTF-8 to UTF-16, when you receive that text. This is because Windows APIs understand Unicode UTF-16. So, UTF-8 is fine for transmitting data e.g. over the Internet, but UTF-16 is fine for processing *inside* Windows applications. To convert from UTF-8 to UTF-16, you can use MultiByteToWideChar API, and you can read and use some code of mine that I shared on an MSDN forum: MSDN Forums -> Visual C++ -> Visual C++ Language -> "Proeblem with some Unicode chars" http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=3200146&SiteID=1 > Now, for example, when I display text from the file in an edit control I > get > "&" where "&" is required (I have added the quotes). Are you sure that your text is UTF-8, and not, for example ISO 8859-1 (Latin-1) ? ISO 8859-1 (Latin-1) http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html I found sometimes that this encoding tends to use: &#<decimal code>; to represent some characters, for example: the "&" symbol has decimal code 38, and so can be represented as & So, the first thing that you must be sure about is the kind of encoding of your text (UTF-8 ? ISO 8859-1 Latin-1?) Assuming that you still have these &#<...>; substrings after conversion (e.g. from UTF-8 to UTF-16), I would parse this text, searching for occurrences of these &#...; substrings, and convert them to corresponding characters. It is not hard. You may also use a regular-expression parser, like CAtlRegExp: http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx HTH, Giovanni
From: David Wilkinson on 12 May 2008 10:57 Jack wrote: > Hello David, > > So you think I need to step through the data twice, once with an HTML parser > and once to convert the data format. Jack: Well, I am not sure how you are extracting your "text fields". But I would think, each time you extract a text field, get rid of the HTML entities in it, and then use MultiByteToWideChar(). This is only one pass through the file. -- David Wilkinson Visual C++ MVP
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: Anyone know how to force screen update after each GDI call -for debugging? Next: Function is not recognizing |