From: Andreas Leitgeb on
Georgios Petasis <petasis(a)iit.demokritos.gr> wrote:
> Στις 3/8/2010 13:19, ο/η eugene έγραψε:
>> On Aug 3, 12:03 pm, Georgios Petasis<peta...(a)iit.demokritos.gr> wrote:
>>> Hm, I have found a page that states that the DROPFILES structure
>>> will never contain data in utf-8 format:
>>> http://www.eggheadcafe.com/software/aspnet/33812038/-copy-paste-with-...
>>> It states "Note that file names in DROPFILES structure are never in
>>> UTF-8. They are either in UTF-16, or in system default ANSI code page."

Would it be possible to get a byte-dump of the relevant part of that
DROPFILES structure?

> I am not sure I will manage to fix it. I compiled tkdnd with UNICODE &
> _UNICODE defined instead of _MBCS, and treated the data as both unicode
> (using Tcl_UniCharToUtfDString) and UTF-16 (using WideCharToMultiByte).
> The result was the same in both cases, a wrong one. Dropping the
> filename "english, русский, العربية, Ελληνικά.txt" results in "english,
> @CAA:89, 'D91(J), •»»·½ΉΊ¬.txt".

I pasted these two strings to "recode u8..utf16 | hd" and got:

english, русский, العربية, Ελληνικά.txt
00000000: FE FF 00 65 00 6E 00 67 - 00 6C 00 69 00 73 00 68 | e n g l i s h|
00000010: 00 2C 00 20 04 40 04 43 - 04 41 04 41 04 3A 04 38 | , @ C A A : 8|
00000020: 04 39 00 2C 00 20 06 27 - 06 44 06 39 06 31 06 28 | 9 , ' D 9 1 (|
00000030: 06 4A 06 29 00 2C 00 20 - 03 95 03 BB 03 BB 03 B7 | J ) , |
00000040: 03 BD 03 B9 03 BA 03 AC - 00 2E 00 74 00 78 00 74 | . t x t|

english, @CAA:89, 'D91(J), •»»·½ΉΊ¬.txt
00000000: FE FF 00 65 00 6E 00 67 - 00 6C 00 69 00 73 00 68 | e n g l i s h|
00000010: 00 2C 00 20 00 40 00 43 - 00 41 00 41 00 3A 00 38 | , @ C A A : 8|
00000020: 00 39 00 2C 00 20 00 27 - 00 44 00 39 00 31 00 28 | 9 , ' D 9 1 (|
00000030: 00 4A 00 29 00 2C 00 20 - 20 22 00 BB 00 BB 00 B7 | J ) , " |
00000040: 00 BD 03 89 03 8A 00 AC - 00 2E 00 74 00 78 00 74 | . t x t|

There does seem to be some pattern... (and some exceptions to it, too)
From: Jeff Hobbs on
On Aug 3, 8:10 am, Georgios Petasis <peta...(a)iit.demokritos.gr> wrote:
> Στις 3/8/2010 13:19, ο/η eugene έγραψε:
>
>
>
> > On Aug 3, 12:03 pm, Georgios Petasis<peta...(a)iit.demokritos.gr>
> > wrote:
>
> >> Hm, I have found a page that states that the DROPFILES structure
> >> will never contain data in utf-8 format:
>
> >>http://www.eggheadcafe.com/software/aspnet/33812038/-copy-paste-with-....
>
> >> It states "Note that file names in DROPFILES structure are never in
> >> UTF-8. They are either in UTF-16, or in system default ANSI code page."
>
> >> So, my assumption that defining _MBCS will have them in utf-8 is not
> >> valid. Windows use the default ANSI page. I will define _UNICODE and
> >> handle it as a unicode string.
>
> >> George
>
> > So then can we expect a patch any time soon? :)
>
> I am not sure I will manage to fix it. I compiled tkdnd with UNICODE &
> _UNICODE defined instead of _MBCS, and treated the data as both unicode
> (using Tcl_UniCharToUtfDString) and UTF-16 (using WideCharToMultiByte).
> The result was the same in both cases, a wrong one. Dropping the
> filename "english, русский, العربية, Ελληνικά.txt" results in "english,
> @CAA:89, 'D91(J), •»»·½ΉΊ¬.txt".

How did you get it to compile with UNICODE and _UNICODE defined? I
added these to OleDND.h and get a lot of code errors about it not
being really wchar-aware.

Jeff
From: Georgios Petasis on
Στις 3/8/2010 22:16, ο/η Jeff Hobbs έγραψε:
> On Aug 3, 8:10 am, Georgios Petasis<peta...(a)iit.demokritos.gr> wrote:
>> Στις 3/8/2010 13:19, ο/η eugene έγραψε:
>>
>>
>>
>>> On Aug 3, 12:03 pm, Georgios Petasis<peta...(a)iit.demokritos.gr>
>>> wrote:
>>
>>>> Hm, I have found a page that states that the DROPFILES structure
>>>> will never contain data in utf-8 format:
>>
>>>> http://www.eggheadcafe.com/software/aspnet/33812038/-copy-paste-with-...
>>
>>>> It states "Note that file names in DROPFILES structure are never in
>>>> UTF-8. They are either in UTF-16, or in system default ANSI code page."
>>
>>>> So, my assumption that defining _MBCS will have them in utf-8 is not
>>>> valid. Windows use the default ANSI page. I will define _UNICODE and
>>>> handle it as a unicode string.
>>
>>>> George
>>
>>> So then can we expect a patch any time soon? :)
>>
>> I am not sure I will manage to fix it. I compiled tkdnd with UNICODE&
>> _UNICODE defined instead of _MBCS, and treated the data as both unicode
>> (using Tcl_UniCharToUtfDString) and UTF-16 (using WideCharToMultiByte).
>> The result was the same in both cases, a wrong one. Dropping the
>> filename "english, русский, العربية, Ελληνικά.txt" results in "english,
>> @CAA:89, 'D91(J), •»»·½ΉΊ¬.txt".
>
> How did you get it to compile with UNICODE and _UNICODE defined? I
> added these to OleDND.h and get a lot of code errors about it not
> being really wchar-aware.
>
> Jeff

Yes. I have corrected all these :-)
I have just committed the changes. I haven't updated TEA though, only cmake.

George
From: Jeff Hobbs on
On Aug 3, 12:29 pm, Georgios Petasis <peta...(a)iit.demokritos.gr>
wrote:
> Στις 3/8/2010 22:16, ο/η Jeff Hobbs έγραψε:
>
>
>
> > On Aug 3, 8:10 am, Georgios Petasis<peta...(a)iit.demokritos.gr>  wrote:
> >> Στις 3/8/2010 13:19, ο/η eugene έγραψε:
>
> >>> On Aug 3, 12:03 pm, Georgios Petasis<peta...(a)iit.demokritos.gr>
> >>> wrote:
>
> >>>> Hm, I have found a page that states that the DROPFILES structure
> >>>> will never contain data in utf-8 format:
>
> >>>>http://www.eggheadcafe.com/software/aspnet/33812038/-copy-paste-with-....
>
> >>>> It states "Note that file names in DROPFILES structure are never in
> >>>> UTF-8. They are either in UTF-16, or in system default ANSI code page."
>
> >>>> So, my assumption that defining _MBCS will have them in utf-8 is not
> >>>> valid. Windows use the default ANSI page. I will define _UNICODE and
> >>>> handle it as a unicode string.
>
> >>>> George
>
> >>> So then can we expect a patch any time soon? :)
>
> >> I am not sure I will manage to fix it. I compiled tkdnd with UNICODE&
> >> _UNICODE defined instead of _MBCS, and treated the data as both unicode
> >> (using Tcl_UniCharToUtfDString) and UTF-16 (using WideCharToMultiByte)..
> >> The result was the same in both cases, a wrong one. Dropping the
> >> filename "english, русский, العربية, Ελληνικά.txt" results in "english,
> >> @CAA:89, 'D91(J), •»»·½ΉΊ¬.txt".
>
> > How did you get it to compile with UNICODE and _UNICODE defined?  I
> > added these to OleDND.h and get a lot of code errors about it not
> > being really wchar-aware.
>
> > Jeff
>
> Yes. I have corrected all these :-)
> I have just committed the changes. I haven't updated TEA though, only cmake.

OK, using those changes I see that there is improvement after adding
#define UNICODE to the sources, but not correctness yet. With the
Greek text, what was ???? is now •»»·½¹º¬, which happens to equate to:

(demos) 62 % encoding convertfrom utf-8 Ελληνικά
•»»·½¹º¬

I suspect a conversion is occurring that shouldn't happen.

Jeff
From: Georgios Petasis on
Στις 3/8/2010 23:22, ο/η Jeff Hobbs έγραψε:
> On Aug 3, 12:29 pm, Georgios Petasis<peta...(a)iit.demokritos.gr>
> wrote:
>> Στις 3/8/2010 22:16, ο/η Jeff Hobbs έγραψε:
>>
>>
>>
>>> On Aug 3, 8:10 am, Georgios Petasis<peta...(a)iit.demokritos.gr> wrote:
>>>> Στις 3/8/2010 13:19, ο/η eugene έγραψε:
>>
>>>>> On Aug 3, 12:03 pm, Georgios Petasis<peta...(a)iit.demokritos.gr>
>>>>> wrote:
>>
>>>>>> Hm, I have found a page that states that the DROPFILES structure
>>>>>> will never contain data in utf-8 format:
>>
>>>>>> http://www.eggheadcafe.com/software/aspnet/33812038/-copy-paste-with-...
>>
>>>>>> It states "Note that file names in DROPFILES structure are never in
>>>>>> UTF-8. They are either in UTF-16, or in system default ANSI code page."
>>
>>>>>> So, my assumption that defining _MBCS will have them in utf-8 is not
>>>>>> valid. Windows use the default ANSI page. I will define _UNICODE and
>>>>>> handle it as a unicode string.
>>
>>>>>> George
>>
>>>>> So then can we expect a patch any time soon? :)
>>
>>>> I am not sure I will manage to fix it. I compiled tkdnd with UNICODE&
>>>> _UNICODE defined instead of _MBCS, and treated the data as both unicode
>>>> (using Tcl_UniCharToUtfDString) and UTF-16 (using WideCharToMultiByte).
>>>> The result was the same in both cases, a wrong one. Dropping the
>>>> filename "english, русский, العربية, Ελληνικά.txt" results in "english,
>>>> @CAA:89, 'D91(J), •»»·½ΉΊ¬.txt".
>>
>>> How did you get it to compile with UNICODE and _UNICODE defined? I
>>> added these to OleDND.h and get a lot of code errors about it not
>>> being really wchar-aware.
>>
>>> Jeff
>>
>> Yes. I have corrected all these :-)
>> I have just committed the changes. I haven't updated TEA though, only cmake.
>
> OK, using those changes I see that there is improvement after adding
> #define UNICODE to the sources, but not correctness yet. With the
> Greek text, what was ???? is now •»»·½¹º¬, which happens to equate to:
>
> (demos) 62 % encoding convertfrom utf-8 Ελληνικά
> •»»·½¹º¬
>
> I suspect a conversion is occurring that shouldn't happen.
>
> Jeff

Which is absolute correct. The library file tkdnd_windows.tcl had in
olednd::_normalise_data a call to "encoding convertfrom $data" for the
CF_HDROP type, that I had completely forgotten about.
Just fixed in the latest SVN HEAD.

Many thanks,

George