Conversion from UTF32 to UTF8 for review [MFC]

Prev: Problems with menu on popup window
Next: Copy A Dialog from Project to Project

From: Paul Bibbings on 1 Jun 2010 14:38

Peter Olcott <NoSpam(a)OCR4Screen.com> writes:

> On 6/1/2010 5:52 AM, Oliver Regenfelder wrote:
>> Hello,
>>
>> Leigh Johnston wrote:
>>> Also printf sucks, this is a C++ newsgroup not a C newsgroup.
>>
>> This is not even a general C++ newsgroup but an MFC one. So
>> strictly there is zero relevance of his posting to this
>> newsgroup.
>>
>> Best regards,
>>
>> Oliver
>
> So no one using MFC (such as I) would ever need to decode UTF-8?

People using MFC also wear coats. Are you going to, then, start talking
about clothing and consider /that/ on-topic?

Regards

Paul Bibbings

From: Peter Olcott on 1 Jun 2010 14:53

On 6/1/2010 1:38 PM, Paul Bibbings wrote:
> Peter Olcott<NoSpam(a)OCR4Screen.com> writes:
>
>> On 6/1/2010 5:52 AM, Oliver Regenfelder wrote:
>>> Hello,
>>>
>>> Leigh Johnston wrote:
>>>> Also printf sucks, this is a C++ newsgroup not a C newsgroup.
>>>
>>> This is not even a general C++ newsgroup but an MFC one. So
>>> strictly there is zero relevance of his posting to this
>>> newsgroup.
>>>
>>> Best regards,
>>>
>>> Oliver
>>
>> So no one using MFC (such as I) would ever need to decode UTF-8?
>
> People using MFC also wear coats. Are you going to, then, start talking
> about clothing and consider /that/ on-topic?
>
> Regards
>
> Paul Bibbings

Unless and until there is a comp.unicode.programmer newsgroup what
alternative do I have to ask Unicode programmer questions?

From: Peter Olcott on 1 Jun 2010 14:59

On 6/1/2010 1:04 PM, Joseph M. Newcomer wrote:
> See below...
> On Tue, 01 Jun 2010 10:34:40 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 6/1/2010 9:34 AM, Joseph M. Newcomer wrote:
>>> See below...
>>> On Mon, 31 May 2010 13:49:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>> On 5/31/2010 1:16 PM, Giovanni Dicanio wrote:
>>>>> "Joseph M. Newcomer"<newcomer(a)flounder.com> wrote:
>>>>>
>>>>>>> UTF8.reserve(UTF32.size() * 4); // worst case
>>>>>> ****
>>>>>> Note that this will call malloc(), which will involve setting a lock,
>>>>>> then searching for a
>>>>>> block to allocate, then releasing the lock. Since you have been a
>>>>>> fanatic about
>>>>>> performance, why is it you put a very expensive operation like
>>>>>> 'reserve' in your code?
>>>>>>
>>>>>> While it is perfectly reasonable, it seems inconsistent with your
>>>>>> previously-stated goals.
>>>>>
>>>>> Joe: I'm not sure if you are ironic or something :) ... but I believe
>>>>> that std::vector::reserve() with a proper capacity value, followed by
>>>>> several push_back()s, is very efficient.
>>>>> Sure, not as efficient as a static stack-allocated array, but very
>>>>> efficient.
>>>>
>>>> He needed to find some excuse to denigrate my code. He has had a
>>>> personal grudge against me for several months. I don't really know what
>>>> I said to offend him, but, it must have occurred sometime after he sung
>>>> very high praises about my patent a few months ago.
>>> ***
>>> I do not have a "personal grudge against you"; what I dislike are people who are
>>> pretentious, who make statements they can't back up, and present code that is inconsistent
>>> with their loudly-touted goals and try to make claims that it is the best possible code
>>> when it is not.
>>>
>>> I defended you against what I thought was an *unfair* accusation, that of being a Patent
>>> Troll. If there are unjust accusations, I will object. But when you batter us to
>>> insensibility about how critical performance is, and talk about presenting the "fastest
>>> possible design", then I am equally offended; designs cannot be executed and therefore
>>> cannot have speed. Code has measurable performance. And the code presented was bad code,
>>
>> Sure it can and indeed it does. Many designs are inherently
>> substantially faster than specific alternatives. Your black and white
>> all or none thinking indicates a perspective that is out of balance.
>>
>> A design based on the query of a specific customer using customer number
>> within a very large database using a linear search is obviously very
>> much slower that a design based on using a B+ tree index. There are
>> countless other examples.
> ***
> No, a "design" is not executable. A choice to use a B-tree or a linear search can only be
> measured in terms of actual code. A design would state that there was a way t map a
> customer number to a record. An implementation decides whether or not a B-tree is used.
> ****
>>
>>> for all the reasons I stated. It has nothing to do with a personal grudge; it has
>>> entirely to do with the fact that you state one thing, then present as evidence of your
>>> correctness something which contradicts your own statement. This is not consistent.
>>> Therefore, it is a target of opportunity to point out that you are not making sense. I
>>> also have to judge code for its correctness not just in the core algorithm, but in the
>>> overall implementation; utility code which uses printf or which even interacts with the
>>> user is not correct code, because it either will not work at all or will produce
>>> meaningless output to the user, and neither of these represent an acceptable design.
>>>
>>> If you make sense, I will defend you. If you prove me wrong with actual numbers, I will
>>> accept your numbers and agree that you are actually right. I did once before. But if you
>>> offer opnions on the performance of artficats that are measurable (code, not designs),
>>> without the data to back them, then you are not making sense, and you need to be told
>>> this.
>>> joe
>>
>> If you measure my code against the incorrect standard that it is
>> specifically encoded to be the fastest possible encoding, even then it
>> is not abysmal. All of the performance improvements that you suggested
>> don't result in as much as a doubling in speed.
>> http://www.ocr4screen.com/UTF8.cpp
>>
>> From benchmarking my code against the code that Hector posted a link to
>> http://bjoern.hoehrmann.de:80/utf-8/decoder/dfa/
>> This other code was only 37% faster.
> *****
> "Only" 37% faster? Actually 37% is a pretty big number in terms of performance! Most
> attempts to "improve" performance are lucky if they get single-digit percentage
> improvement. As someone who spent a nontrivial amount of his life worrying about these
> issues, I can say that 37% is a SUBSTANTIAL performance improvement!
>
> And if it were 1% faster, it would still prove your code was not the fastest possible. But
> 37%? You aren't even in the running in this contest!
> joe
> ****
>>
>> The specific test was to generate 100 instances of every codepoint
>> (skipping the 0x800-0xDFFF range) and then decode these 100 instances.
>> The instances were generated with the code posted in this thread. All
>> memory was allocated in advance so that only the decode speed would be
>> measured.
>>
>> You are certainly smart and educated enough to be able to estimate these
>> results in advance. To call code abysmal merely because it takes 50%
>> more time is certainly not an objective assessment of the actual code
>> quality.
> ****
> 50% more time? Wow! I'd call that "abysmal". Now if it were only 3% slower, I would
> have been guilty of overexaggeration. By objective measure, 50% more time is REALLY BAD!
> ****

Not at all when modern measures of code quality are weighed. Code that
is twice as fast, yet, tenfold more difficult to maintain is inferior
code in all but the most time critical applications.

>>
>> If the code took 50-fold more time and the design goal was maximum
>> performance, then this would surely be abysmal. Since the design goal
>> was not to produce the fastest possible encoding and the speed
>> difference is only 50%, an "abysmal" assessment of code quality is
>> clearly dishonest.
> ****
> No, if it was more than an order of magnitude slower, it would be laughably slower. I
> guess we are discussing the meaning of "abysmal". By my standards, of code performance,
> 50% more time is "abysmal". That is not dishonest. We used to think a 10% improvement
> was substantial. But then, we were all highly-experience programmers (more than half of
> the project team had PhDs), so we knew what to expect.
> joe
> ****
>>
>>
>>> .
>>> ****
>>>>
>>>>>> No, the CORRECT way to write such code is to either throw an exception
>>>>>> (if you are in C++,
>>>>>> which you clearly are) or return a value indicating the error (for
>>>>>> example, in C, an
>>>>>
>>>>
>>>> The "correct" way to handle an error when testing code for the first
>>>> time is to use a printf() statement, or other easy to use debugging
>>>> construct. When the code moves to production, then either of the other
>>>> two suggestions may be appropriate.
>>>>
>>>>> In this case, I'm for exception.
>>>>> Thanks to exception, you could use the precious function return value to
>>>>> actually return the resulting buffer (UTF8 string), instead of passing
>>>>> it as a reference to the function:
>>>>>
>>>>> // Updated prototype:
>>>>> // - use 'const' correctness for utf32
>>>>> // - return resulting utf8
>>>>> // - may throw on error
>>>>> std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32);
>>>>
>>>> For most compilers this requires making an extra copy.
>>>>
>>>>>
>>>>> Note that thanks to the move semantics (i.e. the new "&&" thing of
>>>>> C++0x, available in VC10 a.k.a. VS2010), you don't pay for extra useless
>>>>> copies in returning potentially big objects.
>>>>>
>>>>> Giovanni
>>>>>
>>>>>
>>>>>
>>>> Counting on this results in code that does not have the same performance
>>>> characteristics across multiple platforms.
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 1 Jun 2010 15:29

See below....
On Tue, 01 Jun 2010 13:59:06 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 6/1/2010 1:04 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Tue, 01 Jun 2010 10:34:40 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 6/1/2010 9:34 AM, Joseph M. Newcomer wrote:
>>>> See below...
>>>> On Mon, 31 May 2010 13:49:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>>
>>>>> On 5/31/2010 1:16 PM, Giovanni Dicanio wrote:
>>>>>> "Joseph M. Newcomer"<newcomer(a)flounder.com> wrote:
>>>>>>
>>>>>>>> UTF8.reserve(UTF32.size() * 4); // worst case
>>>>>>> ****
>>>>>>> Note that this will call malloc(), which will involve setting a lock,
>>>>>>> then searching for a
>>>>>>> block to allocate, then releasing the lock. Since you have been a
>>>>>>> fanatic about
>>>>>>> performance, why is it you put a very expensive operation like
>>>>>>> 'reserve' in your code?
>>>>>>>
>>>>>>> While it is perfectly reasonable, it seems inconsistent with your
>>>>>>> previously-stated goals.
>>>>>>
>>>>>> Joe: I'm not sure if you are ironic or something :) ... but I believe
>>>>>> that std::vector::reserve() with a proper capacity value, followed by
>>>>>> several push_back()s, is very efficient.
>>>>>> Sure, not as efficient as a static stack-allocated array, but very
>>>>>> efficient.
>>>>>
>>>>> He needed to find some excuse to denigrate my code. He has had a
>>>>> personal grudge against me for several months. I don't really know what
>>>>> I said to offend him, but, it must have occurred sometime after he sung
>>>>> very high praises about my patent a few months ago.
>>>> ***
>>>> I do not have a "personal grudge against you"; what I dislike are people who are
>>>> pretentious, who make statements they can't back up, and present code that is inconsistent
>>>> with their loudly-touted goals and try to make claims that it is the best possible code
>>>> when it is not.
>>>>
>>>> I defended you against what I thought was an *unfair* accusation, that of being a Patent
>>>> Troll. If there are unjust accusations, I will object. But when you batter us to
>>>> insensibility about how critical performance is, and talk about presenting the "fastest
>>>> possible design", then I am equally offended; designs cannot be executed and therefore
>>>> cannot have speed. Code has measurable performance. And the code presented was bad code,
>>>
>>> Sure it can and indeed it does. Many designs are inherently
>>> substantially faster than specific alternatives. Your black and white
>>> all or none thinking indicates a perspective that is out of balance.
>>>
>>> A design based on the query of a specific customer using customer number
>>> within a very large database using a linear search is obviously very
>>> much slower that a design based on using a B+ tree index. There are
>>> countless other examples.
>> ***
>> No, a "design" is not executable. A choice to use a B-tree or a linear search can only be
>> measured in terms of actual code. A design would state that there was a way t map a
>> customer number to a record. An implementation decides whether or not a B-tree is used.
>> ****
>>>
>>>> for all the reasons I stated. It has nothing to do with a personal grudge; it has
>>>> entirely to do with the fact that you state one thing, then present as evidence of your
>>>> correctness something which contradicts your own statement. This is not consistent.
>>>> Therefore, it is a target of opportunity to point out that you are not making sense. I
>>>> also have to judge code for its correctness not just in the core algorithm, but in the
>>>> overall implementation; utility code which uses printf or which even interacts with the
>>>> user is not correct code, because it either will not work at all or will produce
>>>> meaningless output to the user, and neither of these represent an acceptable design.
>>>>
>>>> If you make sense, I will defend you. If you prove me wrong with actual numbers, I will
>>>> accept your numbers and agree that you are actually right. I did once before. But if you
>>>> offer opnions on the performance of artficats that are measurable (code, not designs),
>>>> without the data to back them, then you are not making sense, and you need to be told
>>>> this.
>>>> joe
>>>
>>> If you measure my code against the incorrect standard that it is
>>> specifically encoded to be the fastest possible encoding, even then it
>>> is not abysmal. All of the performance improvements that you suggested
>>> don't result in as much as a doubling in speed.
>>> http://www.ocr4screen.com/UTF8.cpp
>>>
>>> From benchmarking my code against the code that Hector posted a link to
>>> http://bjoern.hoehrmann.de:80/utf-8/decoder/dfa/
>>> This other code was only 37% faster.
>> *****
>> "Only" 37% faster? Actually 37% is a pretty big number in terms of performance! Most
>> attempts to "improve" performance are lucky if they get single-digit percentage
>> improvement. As someone who spent a nontrivial amount of his life worrying about these
>> issues, I can say that 37% is a SUBSTANTIAL performance improvement!
>>
>> And if it were 1% faster, it would still prove your code was not the fastest possible. But
>> 37%? You aren't even in the running in this contest!
>> joe
>> ****
>>>
>>> The specific test was to generate 100 instances of every codepoint
>>> (skipping the 0x800-0xDFFF range) and then decode these 100 instances.
>>> The instances were generated with the code posted in this thread. All
>>> memory was allocated in advance so that only the decode speed would be
>>> measured.
>>>
>>> You are certainly smart and educated enough to be able to estimate these
>>> results in advance. To call code abysmal merely because it takes 50%
>>> more time is certainly not an objective assessment of the actual code
>>> quality.
>> ****
>> 50% more time? Wow! I'd call that "abysmal". Now if it were only 3% slower, I would
>> have been guilty of overexaggeration. By objective measure, 50% more time is REALLY BAD!
>> ****
>
>Not at all when modern measures of code quality are weighed. Code that
>is twice as fast, yet, tenfold more difficult to maintain is inferior
>code in all but the most time critical applications.
****
But that is not what you kept claiming; you kept claiming "fastest possible design" (never
mind the speed of a design cannot be measured). And do you have any evidence that code
that you cited as much faster is in fact more difficult to maintain?

You seem to keep morphing precedents to fit results: if your code is not the fastest
possible, suddenly you don't mean "fastest possible" you mean something else, like fastest
possible given maintenance costs. But then you don't present any evidence of maintenance
costs, or code complexity (even though there has been, for over 35 years, various
standards to measure code complexity, well-known and widely published, and widely debated
as to their validity...but pick any one of them, and say "According to the metric posed by
[name(s)] as published in [citation] in [year], my code measures [number] and this other
code measures [number2] and by their standards, my code is therefore better".

And I believe you had set the requirement that all such conversions were utterly
time-critical. If anyone (such as I had) suggested a technique that was easier to write
and/or maintain, you would jump down our collective throats and assert that nothing
mattered but performance. Then suddenly, our ideas about development cost and/or ease of
maintenance, while steadfastly rejected in earlier discussions, become critical to
justifying code that is less-than-fastest-possible, by your own measurements,
SUBSTANTIALLY slower. I guess I just want a consistent picture here.

Note that when I said "performance is rarely a critical issue" in several discussions I
was explicitly told by you that this was a Bad Philosophy. I tend to favor ease of coding
and ease of maitenance over raw performance, but you didn't want to hear that, and now
you're insisting that raw performance doesn't matter, that ease of maintenance and reduced
complexity are what matter. Please choose one position and stick with it.
joe
****
>
>>>
>>> If the code took 50-fold more time and the design goal was maximum
>>> performance, then this would surely be abysmal. Since the design goal
>>> was not to produce the fastest possible encoding and the speed
>>> difference is only 50%, an "abysmal" assessment of code quality is
>>> clearly dishonest.
>> ****
>> No, if it was more than an order of magnitude slower, it would be laughably slower. I
>> guess we are discussing the meaning of "abysmal". By my standards, of code performance,
>> 50% more time is "abysmal". That is not dishonest. We used to think a 10% improvement
>> was substantial. But then, we were all highly-experience programmers (more than half of
>> the project team had PhDs), so we knew what to expect.
>> joe
>> ****
>>>
>>>
>>>> .
>>>> ****
>>>>>
>>>>>>> No, the CORRECT way to write such code is to either throw an exception
>>>>>>> (if you are in C++,
>>>>>>> which you clearly are) or return a value indicating the error (for
>>>>>>> example, in C, an
>>>>>>
>>>>>
>>>>> The "correct" way to handle an error when testing code for the first
>>>>> time is to use a printf() statement, or other easy to use debugging
>>>>> construct. When the code moves to production, then either of the other
>>>>> two suggestions may be appropriate.
>>>>>
>>>>>> In this case, I'm for exception.
>>>>>> Thanks to exception, you could use the precious function return value to
>>>>>> actually return the resulting buffer (UTF8 string), instead of passing
>>>>>> it as a reference to the function:
>>>>>>
>>>>>> // Updated prototype:
>>>>>> // - use 'const' correctness for utf32
>>>>>> // - return resulting utf8
>>>>>> // - may throw on error
>>>>>> std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32);
>>>>>
>>>>> For most compilers this requires making an extra copy.
>>>>>
>>>>>>
>>>>>> Note that thanks to the move semantics (i.e. the new "&&" thing of
>>>>>> C++0x, available in VC10 a.k.a. VS2010), you don't pay for extra useless
>>>>>> copies in returning potentially big objects.
>>>>>>
>>>>>> Giovanni
>>>>>>
>>>>>>
>>>>>>
>>>>> Counting on this results in code that does not have the same performance
>>>>> characteristics across multiple platforms.
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer(a)flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Paul Bibbings on 1 Jun 2010 17:07

Peter Olcott <NoSpam(a)OCR4Screen.com> writes:

> On 6/1/2010 1:38 PM, Paul Bibbings wrote:
>> Peter Olcott<NoSpam(a)OCR4Screen.com> writes:
>>
>>> On 6/1/2010 5:52 AM, Oliver Regenfelder wrote:
>>>> Hello,
>>>>
>>>> Leigh Johnston wrote:
>>>>> Also printf sucks, this is a C++ newsgroup not a C newsgroup.
>>>>
>>>> This is not even a general C++ newsgroup but an MFC one. So
>>>> strictly there is zero relevance of his posting to this
>>>> newsgroup.
>>>>
>>>> Best regards,
>>>>
>>>> Oliver
>>>
>>> So no one using MFC (such as I) would ever need to decode UTF-8?
>>
>> People using MFC also wear coats. Are you going to, then, start talking
>> about clothing and consider /that/ on-topic?
>>
>> Regards
>>
>> Paul Bibbings
>
> Unless and until there is a comp.unicode.programmer newsgroup what
> alternative do I have to ask Unicode programmer questions?

How about somewhere like http://unicode.org/consortium/distlist.html.
For some reason you seem to have chosen to ignore any possibility of
suitable fora outside of Usenet. If you allow yourself to think only a
little outside of the box you will find that the `Unicode Email
Distribution Lists'...:

- facilitate a "Discussion list for Unicode and general
internationalization issues";

- have "About 750 members world-wide" and "discuss such subjects as:
implementing the Unicode Standard, discussion of new proposals,
etc."

where:

"Everybody is welcome to join the public email list to pose questions
to the community of Unicode users."

Or... you /could/ post your question to microsoft.public.vc.mfc, a group
for people specifically programming in the Microsoft Visual C++
environment and focussing specifically on the Microsoft Foundation
Classes library components, or ... you /could/ post your question to
comp.lang.c++, a group for people learning or programming in the C++
language who want to ask questions about `the C++ programming
language' (note: *not* about particular things that people might be
doing in particular areas *with* the language, but *about* the language
itself).

Regards

Paul Bibbings

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: Problems with menu on popup window
Next: Copy A Dialog from Project to Project