From: Hector Santos on
On May 19, 1:39 am, "Pete Delgado" <Peter.Delg...(a)NoSpam.com> wrote:
> "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote in message
>
> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>
>
>
> > The finite state machine's detailed design is now completed. Its state
> > transition matrix only takes 2048 bytes. It will be faster than any other
> > possible method.
>
> So once again you find yourself with a *design* that is complete but you
> have not done any *coding*? Yet you claim that it will be faster than any
> other possible method?
>
> Is anyone else noticing a pattern here??? (no pun intended...)
>
> -Pete

Yup, same old pathetic fantasy rhetorical claims by Peter "The Troll"
Olcott.

--
HLS

From: James Kanze on
On May 18, 8:17 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/18/2010 9:34 AM, James Kanze wrote:
> > On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
> >> On 5/17/2010 1:35 AM, Mihai N. wrote:

> >> a regular expression implemented as a finite state machine
> >> is the fastest and simplest possible way of every way that
> >> can possibly exist to validate a UTF-8 sequence and divide
> >> it into its constituent parts.

> > It all depends on the formal specification; one of the
> > characteristics of UTF-8 is that you don't have to look at
> > every character to find the length of a sequence. And
> > a regular expression generally will have to look at every
> > character.

> Validation and translation to UTF-32 concurrently can not be
> done faster than a DFA recognizer, in fact it must always be
> slower.

UTF-8 was designed intentionally in a way that it doesn't
require a complete DFA to handle, but can be handled faster.
Complete DFA's are usually slower than caluculations on modern
processors, since they require memory accesses, and memory is
often the limiting factor.

In fact, there is no "must always be slower". There are too
many variables involved to be able to make such statements.

--
James Kanze
From: Öö Tiib on
On May 19, 1:21 pm, James Kanze <james.ka...(a)gmail.com> wrote:
> On May 19, 12:01 am, Öö Tiib <oot...(a)hot.ee> wrote:
>
> > On 18 mai, 17:18, James Kanze <james.ka...(a)gmail.com> wrote:
>
>     [...]
>
> > > But the trade-offs only concern internal representation.
> > > Externally, the world is 8 bits, and UTF-8 is the only solution.
> > I would be honestly extremely glad if it was the only solution. Real
> > life applications throw in texts in all possible forms also they await
> > responses in all possible forms.
>
> Yes.  I meant it is the only solution if you are choosing
> yourself.  In practice, there are a lot of other solutions being
> used; they don't work, except in limited environments, but they
> are being widely used.
>
> > For example texts in financial transactions done in most
> > Northern Europe assume that  "/\{}[]" means something like
> > "ÄäÅåÖö" (i do not remember correct order, but something like
> > that).
> > I prefer to convert incoming texts into std::wstring. Outgoing
> > texts i convert back to whatever they await (UTF-8 is really
> > relaxing news there, true). All what i need is a set of
> > conversion functions. If it is going to user interface then
> > std::wstring goes and it is business of UI to convert it
> > further into CString or QString or whatever they enjoy there
> > and sort it out for user.
>
> In theory, the conversion should take place in the filebuf,
> using the imbued locale.

Yes, if it is good wfilebuf then my problems are totally unexisting.
Often it is not in practice; instead there are strange protocol layers
and security by obscurity.

> > I perhaps have too low experience with sophisticated text processing.
> > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > full set of conversion functions is all i need really. Peter Olcott
> > raises lot of noise around it and so it makes me a bit
> > interested.  :)
>
> There can be advantages to using UTF-8 internally, as well as at
> the interface level, and if you're not doing too complicated
> things, it can work quite nicely.  But only as long as your
> manipulations aren't too complicated.

My major advantage from using wstring is that ...

Bytes are often too ambiguous information, even if exception like
UTF-8 the information is fully sufficient. Compiler does not make
difference between byte (char) in UTF-8 string, or byte in string in
some other encoding. wstring ensures that compilers/tools can easily
frown upon such bytes that sneak into application layer in whatever
encoding these are and from where-ever these come. That gains
attention at right place and for right reason.

For example there is:
basic_fstream::basic_fstream(const char* s, ios_base::openmode
mode);

If i give wstring::c_str() result as parameter s to that constructor i
get error. So compiler drags my attention to right place. If i get no
error then there is most likely extension to STL that most likely
works correctly. Giving result of string::c_str() (that contains
UTF-8) creates most likely garbage-filled file name.

From: Peter Olcott on
On 5/19/2010 12:39 AM, Pete Delgado wrote:
> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>>
>> The finite state machine's detailed design is now completed. Its state
>> transition matrix only takes 2048 bytes. It will be faster than any other
>> possible method.
>
> So once again you find yourself with a *design* that is complete but you
> have not done any *coding*? Yet you claim that it will be faster than any
> other possible method?
>
> Is anyone else noticing a pattern here??? (no pun intended...)
>
> -Pete
>
>

The code will be complete within a week. Also most of the coding is
complete for my major components.

I would think that it would be self-evident to all who really understand
deterministic finite automatons that nothing can beat the speed of a
state transition matrix.
From: Leigh Johnston on


"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:o86dnTxKTd3Xb27WnZ2dnUVZ_g6dnZ2d(a)giganews.com...
> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>>>
>>> The finite state machine's detailed design is now completed. Its state
>>> transition matrix only takes 2048 bytes. It will be faster than any
>>> other
>>> possible method.
>>
>> So once again you find yourself with a *design* that is complete but you
>> have not done any *coding*? Yet you claim that it will be faster than any
>> other possible method?
>>
>> Is anyone else noticing a pattern here??? (no pun intended...)
>>
>> -Pete
>>
>>
>
> The code will be complete within a week. Also most of the coding is
> complete for my major components.
>
> I would think that it would be self-evident to all who really understand
> deterministic finite automatons that nothing can beat the speed of a state
> transition matrix.

What if such a matrix does not fit into L1 cache?

/Leigh