Unicode again [Win32 API]

Prev: Mutex race?
Next: CreateFile on comm port in non-exclusive mode

From: Jongware on 26 Apr 2010 10:05

On 26-Apr-10 15:43 PM, Jean wrote:
>> Your error is you take the environment requirement "all text strings
>> should be converted to TCHAR" to mean "all /unsigned char/ strings ..."
>> The environment only needs this for its own data.
>
> OK, i understand (well, i think i do :-) )
>
> what about a file with embedded unicode text ?
> in that case the content is not a simple unsigned byte list, correct ?

Yes, you got it. You might want to think about how you would read a
Unicode data file in a non-Unicode environment -- sort of the opposite
of what you have now.

(Even 'reading Unicode text' *in* a Unicode environment needs some
attention, as there is not really something like 'plain Unicode text';
it might be UTF-8 encoded, which is not aware of byte ordering, or it
might have the magic BOM value -- either U+FEFF or U+FFFE -- to indicate
in which order the high-byte/low-byte pairs are.)

Theoretically, you can write your entire program as a *non* Unicode
version and test that. When upgrading to a Unicode version, all you need
to change are the I/O strings -- the filename in your original code
snippet, and any text strings that are to be communicated to the user.
The code itself should not change.

By way of a challenge:
Using MSVC's text macro _T("your text") you can make your code Unicode
*unaware* -- that is, it should compile and run the same, whether you
have defined UNICODE or not. You should only bracket the type of strings
I mentioned above, and not those in 'binary' comparisions, such as

if (strcmp (magicstring, "BM"))
..

where you would *not* use the automatically translated "_tcscmp" because
in this case you are looking for an exact match.

Happy coding,
[Jw]

From: Dee Earley on 26 Apr 2010 11:10

On 26/04/2010 14:43, Jean wrote:
>> Your error is you take the environment requirement "all text strings
>> should be converted to TCHAR" to mean "all /unsigned char/ strings ..."
>> The environment only needs this for its own data.
>
> OK, i understand (well, i think i do :-) )
>
> what about a file with embedded unicode text ?
> in that case the content is not a simple unsigned byte list, correct ?

Incorrect. It will still be a stream of bytes.
It just won't be "plain text".

--
Dee Earley (dee.earley(a)icode.co.uk)
i-Catcher Development Team

iCode Systems

(Replies direct to my email address will be ignored.
Please reply to the group.)

From: r_z_aret on 26 Apr 2010 17:58

On Sun, 25 Apr 2010 16:32:52 +0200, "Jean" <nosp-jean(a)free.fr> wrote:

My comments may be a paraphrase of Jonware's comments. See below (in
line).

>Hello
>
>this code works:
>
>unsigned char buffer[1025];
>fopen("toto.bmp", "rb");
>fread(buffer, sizeof(unsigned char),1024, pf);
>fclose(pf);
>
>if(buffer[0] == 'B' && buffer[1] == 'M')
> ...
>
>this code does not work: (compiled with UNICODE and _UNICODE)
>
>WCHAR buffer[1025];
>_wfopen(L"toto.bmp", L"rb");
>fread(buffer,sizeof(WCHAR),1024,pf);

This line will read two bytes into each of the two byte elements of
buffer.

>fclose(pf);
>
>if(buffer[0] == 'B' && buffer[1] == 'M')
> ...
>
>it's the comparison that does not work.
>(i tried if(buffer[0] == L'B' && buffer[1] == L'M') too)
>any idea ?

This will compare the two bytes in buffer[0] with the two characters
in L'B' and the two bytes in buffer[1] with the two characters in
L'M'. This will probably not work as you expect unless the input file
is Unicode text (so each character takes up two bytes in the file).

I _think_ you are trying to support UNICODE and ASCII files in one
program. I don't think you can do that unless you have a separate
section for each, and your program determines which type of file
you're reading and chooses the right code. I believe it is very tricky
to determine by looking at a file's contents whether it is UNCODE or
ASCII. That is why UNICODE files are usually marked by a preceding BOM
(Byte Order Marker). For more info about BOM, use Google to look it up
in this newsgroup.

Something of a nit pick:
When I first read your note, I assumed "does not work" meant "does not
compile". You might try to be more explicit in the future.

>
>jean
>

-----------------------------------------
To reply to me, remove the underscores (_) from my email address (and please indicate which newsgroup and message).

Robert E. Zaret, MVP
PenFact, Inc.
20 Park Plaza, Suite 400
Boston, MA 02116
www.penfact.com
Useful reading (be sure to read its disclaimer first):
http://catb.org/~esr/faqs/smart-questions.html

From: r_z_aret on 26 Apr 2010 17:58

On Mon, 26 Apr 2010 14:08:56 +0200, Jongware <jongware(a)no-spam.plz>
wrote:

>On 26-Apr-10 6:47 AM, Jean wrote:
> >
> > "ScottMcP [MVP]"<scottmcp(a)mvps.org> a �crit dans le message de news:
> > 45b16a39-252d-403a-80e4-1d1e38b57f52(a)u32g2000vbc.googlegroups.com...
> >> This is comparing unicode data with ANSI characters:
> >>
> >> if(buffer[0] == 'B'&& buffer[1] == 'M')
> >>
> >> Try it this way:
> >>
> >> if(buffer[0] == L'B'&& buffer[1] == L'M')
> >
> >
>>>> if(buffer[0] == L'B'&& buffer[1] == L'M')
>> same effect
>
>For the exact same reason.
>
>Your buffer is a TCHAR, and in a Unicode environment it will use 2 bytes
>per UC character.

No, buffer is explicitly WCHAR. The definition of TCHAR depends on
whether UNICODE is defined. The definition of WCHAR does not. None of
the original code uses TCHAR, so TCHAR has no relevance here, and is a
distraction.

>You read a single-byte array into a double-byte destination. What will
>the contents of buffer look like? Use your debugger! This is from memory:

This statement made me think about the line using fread. See my reply
to the original post.

-----------------------------------------
To reply to me, remove the underscores (_) from my email address (and please indicate which newsgroup and message).

Robert E. Zaret, MVP
PenFact, Inc.
20 Park Plaza, Suite 400
Boston, MA 02116
www.penfact.com
Useful reading (be sure to read its disclaimer first):
http://catb.org/~esr/faqs/smart-questions.html

From: Jean on 27 Apr 2010 01:05

Hi Robert

>> I _think_ you are trying to support UNICODE and ASCII files in one
> program
yes, correct

>I assumed "does not work" meant "does not compile".
no, it compiles correctly, the comparison (==) is not effective

I use VC6 and C SDK, for XP, Vista and 7, with all the previous advices i
compile now with UNICODE and _UNICODE.
All my files accesses are made with _wfopen, all the readings with fread and
an unsigned char buffer.
It works fine with western, greek, chinese and russian file names :-)
For listing the files i use a _w_finddata_t structure with _wfdindfirst,
_wfindnext, it works too
For displaying those file names in listviews ot statusbars, window title and
so on i use WCHAR everywhere

Jean

<r_z_aret(a)pen_fact.com> a �crit dans le message de news:
ni0ct5tluajljpgh319i20oo1emleu7bbi(a)4ax.com...
> On Sun, 25 Apr 2010 16:32:52 +0200, "Jean" <nosp-jean(a)free.fr> wrote:
>
> My comments may be a paraphrase of Jonware's comments. See below (in
> line).
>
>>Hello
>>
>>this code works:
>>
>>unsigned char buffer[1025];
>>fopen("toto.bmp", "rb");
>>fread(buffer, sizeof(unsigned char),1024, pf);
>>fclose(pf);
>>
>>if(buffer[0] == 'B' && buffer[1] == 'M')
>> ...
>>
>>this code does not work: (compiled with UNICODE and _UNICODE)
>>
>>WCHAR buffer[1025];
>>_wfopen(L"toto.bmp", L"rb");
>>fread(buffer,sizeof(WCHAR),1024,pf);
>
> This line will read two bytes into each of the two byte elements of
> buffer.
>
>
>>fclose(pf);
>>
>>if(buffer[0] == 'B' && buffer[1] == 'M')
>> ...
>>
>>it's the comparison that does not work.
>>(i tried if(buffer[0] == L'B' && buffer[1] == L'M') too)
>>any idea ?
>
> This will compare the two bytes in buffer[0] with the two characters
> in L'B' and the two bytes in buffer[1] with the two characters in
> L'M'. This will probably not work as you expect unless the input file
> is Unicode text (so each character takes up two bytes in the file).
>
> I _think_ you are trying to support UNICODE and ASCII files in one
> program. I don't think you can do that unless you have a separate
> section for each, and your program determines which type of file
> you're reading and chooses the right code. I believe it is very tricky
> to determine by looking at a file's contents whether it is UNCODE or
> ASCII. That is why UNICODE files are usually marked by a preceding BOM
> (Byte Order Marker). For more info about BOM, use Google to look it up
> in this newsgroup.
>
> Something of a nit pick:
> When I first read your note, I assumed "does not work" meant "does not
> compile". You might try to be more explicit in the future.
>
>
>>
>>jean
>>
>
> -----------------------------------------
> To reply to me, remove the underscores (_) from my email address (and
> please indicate which newsgroup and message).
>
> Robert E. Zaret, MVP
> PenFact, Inc.
> 20 Park Plaza, Suite 400
> Boston, MA 02116
> www.penfact.com
> Useful reading (be sure to read its disclaimer first):
> http://catb.org/~esr/faqs/smart-questions.html

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Mutex race?
Next: CreateFile on comm port in non-exclusive mode