From: Simon on
Hi,

I am trying to read a file with some Japanese words.
(Well, it has a mix of Japanese and English words).


// --------------------------------
// _UNICODE is defined
//
FILE* fp = 0;
errno_t err = _tfopen_s( &fp, _T("name.txt"), _T("rb") );
....
//-- get the file length
....

TCHAR* buf = new TCHAR[ length+1 ];
memset( buf, 0, length+1 );

if( fread( buf, sizeof(TCHAR), length, file ) != length )
{
...
return
}

....
// --------------------------------

But doing that does not load the file in 'buf' properly.
Even the non Japanese characters are not loaded properly.

What am I doing wrong? (Using notpad++ I can see that the data is as
expected).

Thanks

Simon
From: Giovanni Dicanio on
"Simon" <bad(a)example.com> ha scritto nel messaggio
news:#FEiyP2yKHA.4492(a)TK2MSFTNGP05.phx.gbl...

> I am trying to read a file with some Japanese words.
> (Well, it has a mix of Japanese and English words).

I think you should figure out which encoding the file uses.
The file could be Unicode UTF-16 (LE or BE), or UTF-8...

There is a useful freely-available class that allows you to load texts from
different formats and convert them in Unicode UTF-16 (which is Windows
default Unicode format):

http://www.codeproject.com/KB/files/stdiofileex.aspx

HTH,
Giovanni


From: Oliver Regenfelder on
Hello,

Simon wrote:
> Hi,
>
> I am trying to read a file with some Japanese words.
> (Well, it has a mix of Japanese and English words).

As Giovanni already pointed out, you need to be aware
of the encoding of the file. Besides the various
unicode encodings he mentioned a japanese text file
might also easily be encoded using shift-jis or some
other non unicode encoding.

> FILE* fp = 0;
> errno_t err = _tfopen_s( &fp, _T("name.txt"), _T("rb") );
> ...
> //-- get the file length
> ...
>
> TCHAR* buf = new TCHAR[ length+1 ];
> memset( buf, 0, length+1 );

memset(buf, 0, sizeof(TCHAR)*(length+1));

as TCHAR will be several bytes in size if _UNICODE is
defined.

> if( fread( buf, sizeof(TCHAR), length, file ) != length )

Here again it should be

fread(...) != sizeof(TCHAR) * length

As fread returns the number of bytes read.

> But doing that does not load the file in 'buf' properly.
> Even the non Japanese characters are not loaded properly.
>
> What am I doing wrong? (Using notpad++ I can see that the data is as
> expected).

Well, you are reading the file as a bunch of bytes. But you have to
first convert the read data from the encoding used in the file into
unicode to make real sense of the content.

Best regards,

Oliver
From: Mihai N. on

> But doing that does not load the file in 'buf' properly.
> Even the non Japanese characters are not loaded properly.
>
> What am I doing wrong? (Using notpad++ I can see that the data is as
> expected).

As pointed out already, you have to know the encoding of the file.
But based on the above my best guess is UTF-16.

Then you have to define both _UNICODE and UNICODE.

memset( buf, 0, length+1 );
should be memset( buf, 0, (length+1)*sizeof(TCHAR) );




--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on
>> But doing that does not load the file in 'buf' properly.
>> Even the non Japanese characters are not loaded properly.
>>
>> What am I doing wrong? (Using notpad++ I can see that the data is as
>> expected).
>
> As pointed out already, you have to know the encoding of the file.
> But based on the above my best guess is UTF-16.
>
> Then you have to define both _UNICODE and UNICODE.
>
> memset( buf, 0, length+1 );
> should be memset( buf, 0, (length+1)*sizeof(TCHAR) );

My bad!
Since you have _UNICODE defined and you don't even see the English
right, then the file is anything but UTF-16.

if you on a Japanese system
probably UTF-8 or Shift-JIS (cp932)
else
probably UTF-8

So load the file as bytes, then use MultiByteToWideChar.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email