need unicode help [MFC]

Prev: question about mfc42.dll
Next: cannot execute C preprocessor cl.exe

From: RB on 23 Feb 2010 11:09

Hello, Joe was nice enough to educate me of a void in my awareness of
unicode ramifications. I am in the process of trying to install the safer
strsafe libs and includes. But more pertinent to me is the problem is I am still
struggling to get underneath all aspects of this.
I remember reading years ago (when computers usually had less than
20mb of ram) that a machine word was the width of the computer registers,
usually matching the width of memory at a single address. I.e. address whatever
would be say 2 bytes long in real mode or 4 bytes long in protected mode. And
that ints could be defined different on different machine widths depending on how
the compiler translated the int declaration down into the machine language of the
word. And compilers had to be aware of which platform they were compiling for.
As for old byte sized chars and newer 2 and 4 byte unicode chars the scenario
deepens for me. It would appear that the char/unicode thing is not machine specific
but rather OS and/or Language dependent. But I mainly am concerned at this point
as to when and how this would affect my code running on a windows OS.
If I do not code for Unicode and I copy a string into a font structure on windows
I have been made aware that this is dangerous. But I still have trouble understanding
exactly what is going on. Is it that in newer windows OS structures (like fonts) that
windows has coded them in wide format size so they can accept nonUnicode and
Unicode as needed ? And this affects my string copy to said struct.... but I still
cannot see exactly what is going on (or why). In other words if I am not coding
for Unicode character language why still must I be concerned about unicode.
so I have the following questions:
1. For the following code (said to be unsafe)
strcpy(NewFontLogStruct.lfFaceName, "Courier New");

Is the following (groping self created hack) any safer ?
char holder [ (sizeof(NewFontLogStruct.lfFaceName)) ] = "Courier New";
for(int i = 0; i < (sizeof(NewFontLogStruct.lfFaceName)); i++)
NewFontLogStruct.lfFaceName[i] = holder[i];

2. Could someone direct me to a bibliography that would illuminate a dummy like
me on the ramifications more clearly so I can fully understand the ins and outs of
this ? (or feel free to try and explain it in brief if possible)
I.e. Joe has told me to
>Never use 8-bit characters or assume they exist, except in
>exceedingly rare and exotic circumstances, of which this is most
>definitely not an example.
But if I look at a character in a hex editor on my machine (from a text file)
it is only one byte in size, so obviously Joe (fantastic helpful guy that he is)
is talking over my grasp of the situation. Hopefully I can learn this eventually
to bring myself out of unicode darkness.

From: Giovanni Dicanio on 23 Feb 2010 11:27

"RB" <NoMail(a)NoSpam> ha scritto nel messaggio
news:OWzRtKKtKHA.4636(a)TK2MSFTNGP06.phx.gbl...

> Is it that in newer windows OS structures (like fonts) that
> windows has coded them in wide format size so they can accept nonUnicode
> and
> Unicode as needed ? And this affects my string copy to said struct....
> but I still
> cannot see exactly what is going on (or why). In other words if I am not
> coding
> for Unicode character language why still must I be concerned about
> unicode.

There are two "LOGFONT" definitions: LOGFONTA (using CHAR lfFaceName[...])
and LOGFONTW (using WCHAR lfFaceName[...]).
LOGFONTA uses the old style ANSI/MBCS char's, LOGFONTW uses Unicode (UTF-16)
wchar_t's.
If you are building in Unicode mode and UNICODE preprocessor macro is
defined, then LOGFONT is typedef'ed as LOGFONTW.
Instead, if you are building in ANSI/MBCS (UNICODE preprocessor macro is not
defined), then LOGFONT is typedef'ed as LOGFONTA.
You can read all of that in <wingdi.h> Win32 header file.

> so I have the following questions:
> 1. For the following code (said to be unsafe)
> strcpy(NewFontLogStruct.lfFaceName, "Courier New");

If you use VS2005 and above, the above line is secure in C++ source code,
because strcpy is just expanded to a proper form of strcpy_s thanks to C++
template magic.

> 2. Could someone direct me to a bibliography that would illuminate a dummy
> like
> me on the ramifications more clearly so I can fully understand the ins and
> outs of
> this ? (or feel free to try and explain it in brief if possible)

About Unicode, there is an interesting article here:

"The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)"
http://www.joelonsoftware.com/articles/Unicode.html

Mihai Nita's blog is a must read as well:
http://www.mihai-nita.net/

Couple of posts on "The Old New Thing" blog by Raymond Chen:

http://blogs.msdn.com/oldnewthing/archive/2004/02/12/71851.aspx
http://blogs.msdn.com/oldnewthing/archive/2004/07/15/184076.aspx

And you can't miss these articles on CodeProject written by Mike Dunn:

The Complete Guide to C++ Strings, Part I - Win32 Character Encodings
http://www.codeproject.com/KB/string/cppstringguide1.aspx

The Complete Guide to C++ Strings, Part II - String Wrapper Classes
http://www.codeproject.com/KB/string/cppstringguide2.aspx

HTH,
Giovanni

From: Joseph M. Newcomer on 23 Feb 2010 13:31

On Tue, 23 Feb 2010 11:09:32 -0500, "RB" <NoMail(a)NoSpam> wrote:

> Hello, Joe was nice enough to educate me of a void in my awareness of
>unicode ramifications. I am in the process of trying to install the safer
>strsafe libs and includes. But more pertinent to me is the problem is I am still
>struggling to get underneath all aspects of this.
> I remember reading years ago (when computers usually had less than
>20mb of ram) that a machine word was the width of the computer registers,
****
Generally, this has been accepted. Linguistically, it was always a colossal failure in C
that it tied its concept of "int" to the machine registers, and resulted in several
disasters when we moved from 16-bit to 32-bit; when Microsoft moved to 64-bit, they
retained int as 32-bit instead of making it 64-bit, which was a great idea.
****
>usually matching the width of memory at a single address. I.e. address whatever
>would be say 2 bytes long in real mode or 4 bytes long in protected mode.
****
Well, it wasn't that simple, but it will do for now.
****
>And
>that ints could be defined different on different machine widths depending on how
>the compiler translated the int declaration down into the machine language of the
>word. And compilers had to be aware of which platform they were compiling for.
> As for old byte sized chars and newer 2 and 4 byte unicode chars the scenario
>deepens for me. It would appear that the char/unicode thing is not machine specific
>but rather OS and/or Language dependent.
****
No. The definition of Unicode is independent of all platforms, languages, and operating
systems.

What matters is the "encoding". For example, Windows, as an operating system, only
accepts the UTF-16LE encoding of Unicode, which means that Unicode characters that require
more that 16 bits to identify them require two Unicode UTF-16LE characters (this is the
"surrogate" encoding). And the Microsoft C compiler defines the ANSI/ISO "wchar_t" type
as a 16-bit value. So you are constrained, in using these environments, to using a
specific encoding of Unicode.
*****
>But I mainly am concerned at this point
>as to when and how this would affect my code running on a windows OS.
> If I do not code for Unicode and I copy a string into a font structure on windows
>I have been made aware that this is dangerous. But I still have trouble understanding
>exactly what is going on.
****
There are two symbols which are either both defined or both undefined. One of them is
_UNICODE, and the other is UNICODE. One controls the C runtime, one controls the Windows
runtime. It doesn't matter which is which, because if you define only one of them and not
the other, your program probably won't compile.

Any API that involves a string does not exist. For example there is no CreateFile API.
Instead, there are *two* unique API entry points to the kernel:
CreateFileA: takes an 8-bit character string name
CreateFileW: takes a 16-bit UTF-16LE-encoded character string name

In the case of CreateFont, there are not only two entry points, CreateFontA and
CreateFontW, but two data structures: LONGFONTA and LOGFONTW. When you compile with
Unicode disabled (neither UNICODE nor _UNICODE defined), then your sequence
LOGFONT lf;
...
font.CreateFontIndirect(&lf);
where the method is implemented as
BOOL CFont::CreateFontIndirect(LOGFONT * lf) {return ::CreateFontIndirect(&lf); }
then your code looks like
LOGFONTA lf;
font.CreateFontIndirect(&lf);
with the method actually defined (as far as the compiler sees) as:
CFont::CreateFontIndirect(LOGFONTA * lf) {return ::CreateFontIndirectA(&lf); }
but if you compile with UNICODE/_UNICODE defined, you get
LOGFONTW lf;
...
font.CreateFontIndirect(&lf);
where it is defined as
CFont::CreateFontIndirect(LOGFONTW * lf) { return ::CreateFontIndirectW(&lf);}

Note that all that Windows does is you call a -A entry point is to convert the strings to
Unicode and effectively call the -W entry point.
*****
>Is it that in newer windows OS structures (like fonts) that
>windows has coded them in wide format size so they can accept nonUnicode and
>Unicode as needed ?
****
THe fonts may or may not actually have Unicode characters in them. Many do, but it is up
to the font designer to have decided which characters to include. The "Arial Unicode MS"
font really does have most of the Unicode characters in it, but you can't get at them
unless your app is Unicode, or you explicitly call the -W functions and pass in a wide
character string.
****
>And this affects my string copy to said struct.... but I still
>cannot see exactly what is going on (or why). In other words if I am not coding
>for Unicode character language why still must I be concerned about unicode.
****
Go back to my original question: when your manager walks in and says "We need the app in
Unicode" what are you going to answer? I've been coding Unicode-aware since about 1996,
and several of my apps were converted to Unicode by simple recompilation with
UNICODE/_UNICODE defined, and worked perfectly the first time. In a few cases, I had not
bothered with making everything Unicode-compliant (hard to do in VS6) but the three or six
lines that required work failed to compile, and the fixes where essentially trivial. In
VS > 6 they truly are trivial, because VS > 6 MFC supports two string types, CStringA
(which is always a CString of 8-bit characters) and CStringW (which is always a CString of
Unicode UTF-16LE characters), making it truly trivial to support mixed modes (necessary
when dealing with embedded systems, some network protocols, etc.). If you always code
Unicode-aware, then when you have to create a Unicode app, you already have all the good
programming habits, styles, etc. to make it work. And since all of your coding at that
point is Unicode-aware, you can convert it INSTANTLY and have a high confidence that it
will work perfectly correctly!

The "T"-types (no, I have NO IDEA why "T" figures so prominently) have definitions based
on the settings of these symbols, we find that the representations are

Declaration no [_]UNICODE [_]UNICODE
TCHAR char wchar_t
WCHAR wchar_t wchar_t
CHAR char char
LPTSTR char * wchar_t *
LPWSTR whcar_t * wchar_t *
LPSTR char * char *
CString CStringA CStringW [VS > 6 only]
CStringA CStringA CStringA
CStringW CStringW CStringW
_ftprintf fprintf wprintf
_tcscmp strcmp wcscmp

The use of _T creates literals of the right type. If you don't use it, then the literal
is as you declare it, and is 8-bit or Unicode no matter what your compilation mode
_T("x") "x" L"x"
_T('x') 'x' L'x'
"x" "x" "x"
'x' 'x' 'x'
L"x" L"x" L"x"
L'x' L'x' L'x'

given
TCHAR buffer[2];

_countof(buffer) 2 2
sizeof(buffer) 2 4
sizeof(TCHAR) 1 2

LPSTR s = "x" works works
LPWSTR s = L"x" works works
LPTSTR s = _T("x") works works

LPTSTR s = "x" works compilation error
LPTSTR s = L"x" compilation error works
LPSTR s = _T("x") works compilation error
LPWSTR s = "x" compilation error compilation error
LPSTR s = L"x" compilation error compilation error

For any API:
AnyApi(LPTSTR) AnyApiA(LPSTR) AnyApiW(LPWSTR)

if the API takes a pointer to a struct that has a string, we have in the header file (look
at winuser.h, winbase.h, wingdi.h, or pretty much any Windows header file)

typedef struct { LPSTR p; int x; } SomeStructA;
typedef struct { LPWSTR p; int x; } SomeStructW;
void SomeAPIW(SomeStructW * p);
void SomeAPIA(SomeStructA * p);

#ifdef UNICODE
#define SomeStruct SomeStructW
#define SomeAPI SomeAPIW
#else
#define SomeStruct SomeStructA
#define SomeAPI SomeAPIA
#endif

Learning the correct progrmaming style is *not* done when your manager asks you to convert
200K lines of source to Unicode. It is done when you first start programming.
*****
> so I have the following questions:

>1. For the following code (said to be unsafe)
>strcpy(NewFontLogStruct.lfFaceName, "Courier New");
****
Assume strcpy is ALWAYS unsafe. ALWAYS. NEVER use it ANYWHERE, for ANY REASON
WHATSOEVER. This has nothing to do with Unicode, and everything to do with safe
programming methodologies. strcpy, strcat and sprintf are always-and-forever deadly, and
should never, ever be used in modern programming. They are archaic leftovers from an era
when software safety was considered fairly unimportant. Years of virus infestation have
made us conscious of the fact that these are no longer acceptable.

So while there are Unicode versions wcscpy, wcscat, and swprintf, and Unicode-aware
versions (look in tchar.h) _tcscpy, _tcscat, and _stprintf, these are equally unsafe and
must never be used.

Do you remember "Code Red"? It got in by a buffer overrun caused by a strcpy that didn't
check bounds. Hundreds of thousands of machines were infested in a small number of hours.
[For the person who always says, "Joe, you keep saying things have been broken. How did
we survive all those years if everything was as broken as you claim?" the answer was that
we didn't, and the number of virus infestations that occur because of failure to check
buffer bounds is testimony to the fact that things really WERE broken. Mostly, we had
apps that crashed. That's no longer the case. We now have mission-critical servers and
critical corporate data placed at-risk due to these bad practices. Denial-of-service,
data corruption, and industrial espionage are among the risks.]
****
>
> Is the following (groping self created hack) any safer ?
>char holder [ (sizeof(NewFontLogStruct.lfFaceName)) ] = "Courier New";
> for(int i = 0; i < (sizeof(NewFontLogStruct.lfFaceName)); i++)
> NewFontLogStruct.lfFaceName[i] = holder[i];
****
No. These are unnecessary, and the code is in fact incorrect. You don't need a holder at
all, for example, and it would be inappropriate to introduce a gratuitous variable for
this purpose. Your copy is overkill, because you only need to copy up to the NUL. It is
also erroneous, in that it fails to NUL-terminate a string that is exactly as long as
_countof(lfFaceName), or longer, resulting in incorrect behavior when the data is used in
the future.

Furthermore, you have still assumed that you are using 8-bit characters and that sizeof()
is the correct approach. This code will NOT work with Unicode.

I showed you the correct code. Use either _tcscpy_s or StringCchCopy. If you want to
write the above code correctly (although it is all completely unnecessary) it would be

TCHAR holder[sizeof(NewFontLogStruct.lfFaceName)/sizeof(TCHAR)] = _T("Courier New");
for(int i = 0; i < sizeof(NewLogFontStruct.lfFaceName)/sizeof(TCHAR); i++)
{
NewFontLogStruct.lfFaceName[i] == holder[i];
if(holder[i] == _T('\0'))
break;
}
NewLogFontStruct.lfFaceName[ (sizeof(NewLogFontStruct.lfFaceName)/sizeof(TCHAR)) -1] =
_T('\0');

Notice how much easier it is to use _tcscpy_s or StringCchCopy!

It is useful to do the following

#ifndef _countof
#define _countof(x) ( (sizeof(x) / sizeof((x)[0]))
#endif

This works for all versions < VS2008 and doesn't do anything in VS2008 where _countof is
readly defined.

Then you could write

TCHAR holder[ _countof(NewFontLogStruct.lfFaceName) ] = _T("Courier New");
for(int i = 0; i < _countof(NewFontLogStruct.lfFaceName); i++)
{
NewFontLogStruct.lfFaceName[i] == holder[i];
if(holder[i] == _T('\0'))
break;
}
NewLogFontStruct.lfFaceName[_countof(NewLogFont) - 1] = _T('\0');

Now why the "break" statement? Because consider the case where you have two pages:

| Courier New\0|###################|

where ##### is a page that actually does not exist. If you try to copy more characters
than the string "Courier New" (including the terminal NUL) then you will take an access
fault. So you MUST terminate the copy on a NUL character. Maybe you can get away with it
in the case of a local variable, but it is not at all good policy, and because of the
potential error, should not be written that way.

In particular, you don't need the local variable, you could have written
LPTSTR holder = _T("Courier New");
and then the above error would be potentially fatal.
****
>
>2. Could someone direct me to a bibliography that would illuminate a dummy like
>me on the ramifications more clearly so I can fully understand the ins and outs of
>this ? (or feel free to try and explain it in brief if possible)
>I.e. Joe has told me to
>>Never use 8-bit characters or assume they exist, except in
>>exceedingly rare and exotic circumstances, of which this is most
>>definitely not an example.
*****
This deals with the notion of always creating programs that represent absolutely best
practice. Sure, it's "safe" to use 8-bit characters as a way of life, until the day you
land the Chinese, Korean, or Japanese software contract. Then you find your future as an
employee of that company severely at risk. And it also means that if you have carefully
written code assuming that sizeof(buffer) == number-of-characters-in-buffer, your code is
riddled with fatal errors.

Consider the following
TCHAR buffer[SOMESIZE];
SomeAPI(buffer, sizeof(buffer));

this works ONLY in 8-bit characters. Suppose SOMESIZE is 20. You get
SomeAPI(buffer, 20);

meaning there is space in the buffer for 20 characters. But when you convert to Unicode,
the call becomes
TCHAR buffer[20];
SomeAPI(buffer, 40);

so you tell the OS it has 40 character positions, when in fact you only have 20. The
correct code is
SomeAPI(buffer, sizeof(buffer)/sizeof(TCHAR));
or
SomeAPI(buffer, _countof(buffer));

which, independent of compilation mode, ALWAYS compiles correctly, and will compile as
either
SomeAPIA(buffer, 20);
or
SomeAPIW(buffer, 20);
because all APIs that take strings are fictional; only the -A and -W forms actually exist.

Similarly, if you do WriteFile, which writes BYTES, then you have to write, for example
LPTSTR data;
....
::WriteFile(data, _tcslen(data) * sizeof(TCHAR), &bytesWritten, NULL);
because you have to convert character counts (_tcslen) to byte counts (required by
WriteFile). If you always code this way, the actual conversion to Unicode is often a
recompilation. If you just use _tcslen, then it works correctly for 8-bit apps but only
writes HALF the text for Unicode.

The issues are not "immediate" safety of an 8-bit app, but the ultimate safety if it is
converted to Unicode. Doing a Unicode conversion of 200K lines which were never written
Unicode-aware is a tedious, perilous operation which may result in unexpected fatal errors
including application crashes, security problems which may arise from buffer overruns,
subtle data corruption failures (e.g., if WriteFile only wrote half the text and nobody
noticed for a year...)

VS2005 and later by default generate Unicode apps; if you have the correct programming
habits, your code will naturally flow even when you upgrade from 8-bit to Unicode. Pieces
of code you write can be used in more modern environments. It's just Good Programming
Style.
*****
>But if I look at a character in a hex editor on my machine (from a text file)
>it is only one byte in size, so obviously Joe (fantastic helpful guy that he is)
>is talking over my grasp of the situation. Hopefully I can learn this eventually
>to bring myself out of unicode darkness.
****
If you compiled with UNICODE/_UNICODE undefined, then by default you get 8-bit characters.
If you compile with UNICODE/_UNICODE defined, you will see that your characters are
16-bit. But you will have a ton of errors until you recode as Unicode-aware. Then you
can compile in either mode. But you are developing the right programming habits for the
"real world" of commercial application design, and modern MFC programming.
joe
****
>
>
>
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: RB on 23 Feb 2010 20:16

Thanks Giovanni for the information and I will go over these links. It is going to
take me awhile to decipher through Joe's reply, but I will reply to him later.
I feel like I am starting to get a bit more comprehension on how this affects
any scenario but still assembling all the pieces.

> If you use VS2005 and above, the strcpy is just expanded to a proper form of strcpy_s thanks to C++ template magic.

> About Unicode, there is an interesting article here:
>
> "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
> http://www.joelonsoftware.com/articles/Unicode.html
>
> Mihai Nita's blog is a must read as well:
> http://www.mihai-nita.net/
>
> Couple of posts on "The Old New Thing" blog by Raymond Chen:
>
> http://blogs.msdn.com/oldnewthing/archive/2004/02/12/71851.aspx
> http://blogs.msdn.com/oldnewthing/archive/2004/07/15/184076.aspx
>
> And you can't miss these articles on CodeProject written by Mike Dunn:
>
> The Complete Guide to C++ Strings, Part I - Win32 Character Encodings
> http://www.codeproject.com/KB/string/cppstringguide1.aspx
>
> The Complete Guide to C++ Strings, Part II - String Wrapper Classes
> http://www.codeproject.com/KB/string/cppstringguide2.aspx
>
> HTH,
> Giovanni
>

From: RB on 24 Feb 2010 18:00

> tchar.h has these automatic, that is, if you want to check a character for alphabetic,
> you would call _istalpha(...)
> which will work if the build is either Unicode or 8-bit, whereas
> isalpha(...)
> works correctly only if the character is 8-bit, and
> iswalpha(...)
> works correctly only if the character is Unicode (but if you call setlocale correctly,
> will handle alphabetic characters in other languages.

Ok this sounds good, some of the work can be done for me if I learn enough.
Before I was individually writing separate code sections called from alternating
areas in my code like:
#ifdef UNICODE
iswalpha(...)
#ifndef UNICODE
isalpha(...)
and the all the code I wrote in each depending section upon returns was getting
to be too much for me. But from what you are saying it would appear if I
educate myself some more on Text Routine Mappings of the TCHAR type,
I could just call the _istalpha(...) [ which on my system maps to _ismbcalpha(...) ]
and then only write the "one and only" return code routine. This sounds very good.
Heck yes it does.

So if I code all my character Variables in TCHAR a lot of my mapping will be
done for me depending on whether the UNICODE is defined or not.

And for string literals does it matter if I use TEXT, _TEXT, or _T ?
Some bibliographies say for C++ I should be using TEXT
while others say I can use either _TEXT or _T either one seems to expand to the
same result.

And does it matter where I define (or not define) UNICODE in my source files ?

> both 8-bit and Unicode as determined on-the-fly during runtime. It is trivial
> in VS > 6 because you can read the data in as 8-bit, immediately convert it to
> Unicode, and continue on, not having to do anything special except use CStringA
> for the 8-bit input)

Yea I am going to have to talk my wife into the cost of a new VS it would appear.
I don't think I qualify for the upgrade pricing since I bought my first one under
Academic discount pricing as a student in college (taking courses at night).

> If you expect to get through your entire career never writing code for anyone other
> than yourself, it won't matter. But if you write anything that goes out the door,
> you should probably expect that Unicode support will be required. Even simple
> things like people's surnames in another language can be an issue, For example,
> suppose you want to get the correct spelling of the composer Antonin Dvorak.
> The "r" has a little accent mark over it, and you can only represent that in Unicode.

Yea that is a good premise example. Sounds like he might be Swedish

> When VS2005 came along and by default created Unicode apps, I never noticed.
> I kept programming in exactly the same way I had been programming for a decade.

I have an option to buy a VS pro 2005 at a good price but I heard that 2005 did
not have a class wizard etc, what is your input on that?

> It's 2010. As far as writing code, 8-bit characters are almost completely dead.
> Note that many *files* are still kept in Unicode, but that's not the same as
> programming, because you can always use an 8-bit encoding like UTF-8 to keep your
> text in 8-bit files.

Yes I am aware of the different prefixed codes for files

> But you should always "think Unicode" inside a program. It's worth the time.

Ok I will start trying immediately. Thanks again. (for everything)
Later.........RB

| Next | Last
Pages: 1 2 3
Prev: question about mfc42.dll
Next: cannot execute C preprocessor cl.exe