New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Joseph M. Newcomer on 19 May 2010 11:15

You just don't get it, do you? NOTHING can beat the performance of a state transition
matrix, because Peter says this must be true! And who are the rest of us poor people to
question him?
joe

On Wed, 19 May 2010 15:32:50 +0100, "Leigh Johnston" <leigh(a)i42.co.uk> wrote:

>
>
>"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>news:o86dnTxKTd3Xb27WnZ2dnUVZ_g6dnZ2d(a)giganews.com...
>> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>>>>
>>>> The finite state machine's detailed design is now completed. Its state
>>>> transition matrix only takes 2048 bytes. It will be faster than any
>>>> other
>>>> possible method.
>>>
>>> So once again you find yourself with a *design* that is complete but you
>>> have not done any *coding*? Yet you claim that it will be faster than any
>>> other possible method?
>>>
>>> Is anyone else noticing a pattern here??? (no pun intended...)
>>>
>>> -Pete
>>>
>>>
>>
>> The code will be complete within a week. Also most of the coding is
>> complete for my major components.
>>
>> I would think that it would be self-evident to all who really understand
>> deterministic finite automatons that nothing can beat the speed of a state
>> transition matrix.
>
>What if such a matrix does not fit into L1 cache?
>
>/Leigh
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 19 May 2010 11:25

Actually, sorting is not the business of the GUI; in fact, I could write a GUI-free
application that is supposed to produce, say, a phone book or a list of authors, and it
shold do the right thing for given locales (one locale that is sadly missing is "ALA
sorting", the sort order specified by the American Library Association, for sorting
accented characters in English-based U.S. libraries; I've had to simulate ALA rules, and
it is ugly). Sort order is a locale-specifc abstraction, and Windows provides an API,
CompareString, that gives the relationship of two strings. CompareStringW will apply
sorting rules even for Chinese characters so the sort in the correct order, not the order
of their codepoints.

Essentially, you really mean to say that sort order is a *presentation* issue, and it
doesn't matter if the presentation is in a GUI or a printed listing or a
properly-formatted text file which will be examined by some other means later (printed,
opened in NotePad, etc.). While a GUI is one form of presentation, it is by no means the
only form. I was implementing ALA rules back in 1972 in a batch-oriented non-GUI database
application (I may have been one of the earliest examples of a publish-from-database
application). It also used ASCII-7, and we had encodings of multibyte sequences for the
international characters. I did a similar application using ISO-8859-1 back in 1990. At
no point was a GUI involved.
joe

On Wed, 19 May 2010 01:50:55 -0700 (PDT), �� Tiib <ootiib(a)hot.ee> wrote:

>On May 19, 8:24�am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
>> > I perhaps have too low experience with sophisticated text processing.
>> > Simple std::sort(), wide char literals of C++ and boost::wformat plus
>> > full set of conversion functions is all i need really.
>>
>> It depends a lot what you need.
>>
>> Sorting is locale-sensitive (German, Swedish, French, Spanish, all
>> have different sorting rules).
>> The CRT (and STL, and boost) are pretty dumb when dealing with things
>> in a locale sensitive way (meaning that they usualy don't :-)
>
>Yes, sorting in real alphabetic order for user is perhaps business of
>GUI. GUI has to display it. GUI however usually has its WxStrings or
>FooStrings anyway. I hate when someone leaks these weirdos to
>application mechanics layer. Internal application logic is often best
>made totally locale-agnostic and not caring about positioning in GUI
>and if the end-users write from up to down or from right to left.
>
>So text in electronic interfaces layer are bytes, text in application
>layer are wchar_t and text in user interface layer are whatever weirdo
>rules there. If maintainer forgets to convert in interface between
>layers he gets compiler warnings or errors. That makes life easy, but
>i suspect my problems with texts are more trivial than these of some
>others.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 19 May 2010 11:35

On 5/19/2010 9:32 AM, Leigh Johnston wrote:
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:o86dnTxKTd3Xb27WnZ2dnUVZ_g6dnZ2d(a)giganews.com...
>> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>>>>
>>>> The finite state machine's detailed design is now completed. Its state
>>>> transition matrix only takes 2048 bytes. It will be faster than any
>>>> other
>>>> possible method.
>>>
>>> So once again you find yourself with a *design* that is complete but you
>>> have not done any *coding*? Yet you claim that it will be faster than
>>> any
>>> other possible method?
>>>
>>> Is anyone else noticing a pattern here??? (no pun intended...)
>>>
>>> -Pete
>>>
>>>
>>
>> The code will be complete within a week. Also most of the coding is
>> complete for my major components.
>>
>> I would think that it would be self-evident to all who really
>> understand deterministic finite automatons that nothing can beat the
>> speed of a state transition matrix.
>
> What if such a matrix does not fit into L1 cache?
>
> /Leigh
(1) 2048 bytes should fit, (of the UTF-8 recognizer) even with other
stuff there too.

(2) As far as I am aware if L1 exists then there is always L2 and/or L3

I can currently recognize a whole screen's worth of text using my
original OCR4Screen technology in 1/10 second on a 2.4 Ghz Celeron and
it uses a state transition matrix too far large to fit into any cache.

By taking Joe Newcomer's teaching about cache to heart, I upgraded my
machine to much faster RAM and larger cache and got huge improvement in
speed. Also the redesigned algorithm uses so much less memory that the
whole DFA will most often fit within cache.

From: Peter Olcott on 19 May 2010 11:36

On 5/19/2010 9:52 AM, �� Tiib wrote:
> On May 19, 5:14 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>>
>> I would think that it would be self-evident to all who really understand
>> deterministic finite automatons that nothing can beat the speed of a
>> state transition matrix.
>
> Rest assured, there are no silver bullets in software industry.
> Fortunately. Otherwise there was nothing to do soon.
>
> Large state transition matrixes cause large static data. Large static
> data often causes cache misses. Few additional comparisions in state
> transition algorithm are LOT better than frequent cache misses.

That is not what I have found in the DFAs that I have created.

From: Peter Olcott on 19 May 2010 11:37

On 5/19/2010 10:10 AM, �� Tiib wrote:
> On May 19, 5:52 pm, �� Tiib<oot...(a)hot.ee> wrote:
>> On May 19, 5:14 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>
>>> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>>
>>> I would think that it would be self-evident to all who really understand
>>> deterministic finite automatons that nothing can beat the speed of a
>>> state transition matrix.
>>
>> Rest assured, there are no silver bullets in software industry.
>> Fortunately. Otherwise there was nothing to do soon.
>>
>> Large state transition matrixes cause large static data. Large static
>> data often causes cache misses. Few additional comparisions in state
>> transition algorithm are LOT better than frequent cache misses.
>
> Also JK has already said elsewhere in this thread that UTF-8 is
> designed to be simple to handle without such matrixes. Branch
> prediction in most processors is most probably correct most of the
> time since people do not use Hebrew, Chinese and Creek languages in
> same sentence and matrixes are therefore pretty apparently just a
> waste of memory space and time.

I posted my detailed design here twice now. See if you can see anything
that could form a possible improvement.

First | Prev | Next | Last
Pages: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish