New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Oliver Regenfelder on 21 May 2010 08:26

Hello,

Peter Olcott wrote:
>> Peter Olcott wrote:
>>> Yes and quite often with zero percent accuracy at screen resolutions.
>>> The most accurate alternative system scored about 25% accuracy on the
>>> sample image and was 872-fold slower.
>>
>> You're sure it wasn't 872.3-fold slower?
>>
>> Best regards,
>>
>> Oliver
>
> Mine took 0.047 seconds using clock() ** theirs took 41 second +- 1
> second using my wrist watch.

41 / 0.047 = 872.34
^

See, my crystal ball is in perfect working condition.

Seriously, if you take +-1 milliseconds for the clock() (and typically
clock() is less accurate) and +- 1 second for your wrist watch. Then the
numbers are anywhere from 833 to 913.

Although I understand that it wouldn't be Olcott-style you would be
better of just saying "around 800 times faster".

Best regards,

Oliver

From: Peter Olcott on 21 May 2010 10:43

On 5/21/2010 7:26 AM, Oliver Regenfelder wrote:
> Hello,
>
> Peter Olcott wrote:
>>> Peter Olcott wrote:
>>>> Yes and quite often with zero percent accuracy at screen resolutions.
>>>> The most accurate alternative system scored about 25% accuracy on the
>>>> sample image and was 872-fold slower.
>>>
>>> You're sure it wasn't 872.3-fold slower?
>>>
>>> Best regards,
>>>
>>> Oliver
>>
>> Mine took 0.047 seconds using clock() ** theirs took 41 second +- 1
>> second using my wrist watch.
>
> 41 / 0.047 = 872.34
> ^
>
> See, my crystal ball is in perfect working condition.
>
> Seriously, if you take +-1 milliseconds for the clock() (and typically
> clock() is less accurate) and +- 1 second for your wrist watch. Then the
> numbers are anywhere from 833 to 913.
>
> Although I understand that it wouldn't be Olcott-style you would be
> better of just saying "around 800 times faster".
>
> Best regards,
>
> Oliver

If I did that I would use normal rounding rules and round up to 900. If
I did that I would be overstating the results. That is why I say 872
fold faster.

From: Joseph M. Newcomer on 21 May 2010 14:38

Thanks. It is interesting to compare the results of sorting (in my Locale Explorer)
according to the various rules, such as German Telephone Book sorting (raw sorting, such
as you might get with strcmp, is not correct, because it is based on the old ASCII-7
translation, where �� followed Z because in German ASCII-7, what we recognize as [\]
turned into ��. And it is even worse if you allow 8859-1 (the old ISO Latin-1), because
under strcmp the accented characters all sort after the unaccented ones). This can easily
be seen in my locale explorer by going to the CompareString page, and selecting "All
characters", selecting the locale, and selecting the sorting method. Note that for
_tcscmp I do not use other than the "C" locale.
joe

On Thu, 20 May 2010 23:08:38 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>> Note that they do refer to ISO 10646 (see the footnote on page 19 of
>> the draft standard (30) which has already been cited in this thread).
>
>This is what I was alluding to when I wrote "changing lately." (although not
>technically Unicode, ISO 10646 is a good subset, kept in sync with Unicode
>pretty well)
>(by "subset" I don't mean fewer characters encoded, but "some parts missing",
>like all the character properties, and all the UTS-es)
>
>
>> The key document is referenced as
>> http://www.rfc-editor.org/rfc/bcp/bcp47.txt which is
>> actually RFC5646. This is a lengthy document but worth reading.
>
>And that is a bad thing, because RFC5646 is a way to tag languages, not
>locales.
>
>In most cases there is no difference, but with UTS-35 you can say:
> de-DE(a)collation=phonebook (German-Germany with phonebook sorting)
> or ar(a)calendar=islamic (Arabic with Islamic calendar)
> or even ja-JP(a)calendar=japanese;numbers=jpanfin (Japanese-Japan,
> using the Japanese imperial calendar and Japanese financial numerals)
>
>That's something you can't do with RFC5646 (in fact the RFC says
> "For systems and APIs, language tags form the basis for most
> implementations of locale identifiers." and they send you to UTS-35
> as an example)
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 21 May 2010 15:11

See below...
On Thu, 20 May 2010 20:20:24 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 7:31 PM, Joseph M. Newcomer wrote:
>> Actually, cache locality is only ONE of the parameters. Instruction pipe depth, and as
>> pointed out, pipe flushing; speculative execution (such as the x86s and many other
>> architectures do very well), dynamic register renaming, L2 cache vs. L1 cache, operand
>> prefetching, the depth of the operand lookahead pipe, etc. all come into play. In the
>> case of the x86, these vary widely across families of chips; lower-power (e.g., laptop)
>> chips generally have fewer of these features than server-oriented chipsets (e.g., high-end
>> Xeon, and i9). All these features involve more transistors, and higher clock speeds, and
>> both of these translate into higher power requirements, Little factors like TLB
>> collisions and TLB flush rates can change performance by integer multipliers, not just
>> single-digit percentages. The effects of network traffic and other kernel activities,
>> which impact the pipelines, TLB, caches, etc. can be quite disruptive to pretty models of
>> behavior, even if you manage to model precisely what is going on in the abstract chip set.
>> I've seen my desktop report processing 1K interrupts/second, so 1K times per second my
>> idealized model of cache management gets scrambled by code I have no control over. This
>> is reality. This is why NOTHING matters except MEASURED performance.
>
>This is a gross over exaggeration. I once had a non techie boss that
>wrote a program that read his data from disk fifty times because there
>was fifty different kinds of data. It could be easily known in advance
>that there is a much better way to do this.
****
But since you didn't state what the requirement was, you don't know that this
implementation didn't meet the requirement. Sure, performance sucked, but you seem to
keep confusing design with implementation. Design sets goals; implementation realizes
them.
*****
>
>A more accurate statement might be something like unmeasured performance
>estimates are most often very inaccurate. It is also probably true that
>faster methods can often be discerned from much slower (at least an
>order of magnitude) methods without measurement.
****
You are still confusing design and implementation.
****
>
>Because I am so fanatical about optimization, and I have done some
>further investigation, I am still confident that my UTF-8 recognizer has
>the fastest possible design. I would agree with you that this statement
>really doesn't count until proven with working code.
>
>> Not theoretical
>> performance, not performance under some "ideal conditions" model, bit performance
>> predicted by counting instructions or guessing at memory delays, but ACTUAL, MEASURED
>> performance.
>>
>> This is why the only measure of performance is actual execution, and your numbers are
>> valid ONLY on the machine and under the conditions you measure them with, and do not
>> necessarily predict good performance on a different CPU model or different motherboard
>> chipset.
>> joe
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 21 May 2010 15:37

On 5/21/2010 2:11 PM, Joseph M. Newcomer wrote:
> See below...
> On Thu, 20 May 2010 20:20:24 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> A more accurate statement might be something like unmeasured performance
>> estimates are most often very inaccurate. It is also probably true that
>> faster methods can often be discerned from much slower (at least an
>> order of magnitude) methods without measurement.
> ****
> You are still confusing design and implementation.
> ****

The way that I do design I start with broad goals that I want to achieve
and end up with nearly correct code as my most detailed level of design.
I progress from the broad goals through very many levels of increasing
specificity using a hierarchy of increasing specificity.

So I am not confusing design with implementation, implementation is the
most detailed level of design within a continuum of increasing
specificity from broad goals to working code.

Only about 3% of my time is spent on debugging, with another 2% on
testing. The quickest way to complete any very complex system is to slow
down and carefully plan every single step.

First | Prev | Next | Last
Pages: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish