Designing a Finite State Machine DFA Recognizer for UTF-8 [MFC]

Prev: Does anyone copyright or patent their applications?
Next: Designing a Finite State Machine DFA Recognizer for UTF-8

From: Leigh Johnston on 19 May 2010 14:42

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:P6idnX4azPfvs2nWnZ2dnUVZ_oudnZ2d(a)giganews.com...
>>
>> Whilst what you say is technically correct I try to avoid writing code
>> which does not check against an end iterator when iterating over a
>> sequence, just personal preference (due to a slight concern re safety).
>> We are probably only talking about an extra CPU instruction or two to
>> check for end of sequence in the main loop along with the O(1) check of
>> the final state when the main loop is exited. Your solution would also
>> require making a copy of the input sequence to allow appending of the
>> sentinel unless you consider mutating input parameters to be OK. My
>
> The main purpose of this is to read in a file of UTF-8 to be converted to
> UTF-32. I don't have to mutate the input at all, the user must know to
> append the 0xFF byte.

Are you for real? That sounds like a really stupid idea.

>
>> utf8_to_wide function caters for the optional mixing of valid UTF-8
>> sequence along with non UTF-8 characters in the 0x80-0xFF range which
>> would be problematic for a 0xFF sentinel but different projects have
>> different requirements.
>
> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless the
> data is corrupted.
>

I said it depends on the project. Firstly my requirement is for conversion
to UTF-16 not UTF-32. Secondly one of my requirements is for the support of
a mixture of UTF-8 and "raw" characters which is not corruption but
real-world data.

>>
>> Please show me where it says that "swearing on the Internet is
>> unprofessional" is a universal rule?
>>
>> /Leigh
>
> It is not a universal rule, it is a very commonly accepted norm. It is not
> the swearing that really counts, it is the irate demeanor that is
> indicated by the swearing (and other things) that would be intolerable in
> the typical office setting.
>
> There may be many places where a little friendly swearing is OK. Far fewer
> places would tolerate much more than the tiniest trace of hostility before
> the hostile individual is escorted to the door.

Your recent activity in this newsgroup has been rather troll-like resulting
in what you perceive as hostility which is in fact an understandable
response to off-topic spam. Using tables to improve the performance of an
algorithm is not a new thing, people have been doing it for years (including
me). People have also been using table based finite state machines for
years. You keep banging on about something which is neither particularly
interesting nor on-topic.

/Leigh

From: Peter Olcott on 19 May 2010 15:12

On 5/19/2010 1:42 PM, Leigh Johnston wrote:
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:P6idnX4azPfvs2nWnZ2dnUVZ_oudnZ2d(a)giganews.com...
>>>
>>> Whilst what you say is technically correct I try to avoid writing code
>>> which does not check against an end iterator when iterating over a
>>> sequence, just personal preference (due to a slight concern re safety).
>>> We are probably only talking about an extra CPU instruction or two to
>>> check for end of sequence in the main loop along with the O(1) check of
>>> the final state when the main loop is exited. Your solution would also
>>> require making a copy of the input sequence to allow appending of the
>>> sentinel unless you consider mutating input parameters to be OK. My
>>
>> The main purpose of this is to read in a file of UTF-8 to be converted
>> to UTF-32. I don't have to mutate the input at all, the user must know
>> to append the 0xFF byte.
>
> Are you for real? That sounds like a really stupid idea.

The goal is to make the fastest possible validation of UTF-8 and
translation to UTF-32. Within this binding contsraint there are few
options. Copying the input data is not one of them. What else does that
leave? Mutating the Input and then changing it back?

From: Leigh Johnston on 19 May 2010 15:24

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:69ydnUm-AOC_pWnWnZ2dnUVZ_u2dnZ2d(a)giganews.com...
> On 5/19/2010 1:42 PM, Leigh Johnston wrote:
>>
>>
>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>> news:P6idnX4azPfvs2nWnZ2dnUVZ_oudnZ2d(a)giganews.com...
>>>>
>>>> Whilst what you say is technically correct I try to avoid writing code
>>>> which does not check against an end iterator when iterating over a
>>>> sequence, just personal preference (due to a slight concern re safety).
>>>> We are probably only talking about an extra CPU instruction or two to
>>>> check for end of sequence in the main loop along with the O(1) check of
>>>> the final state when the main loop is exited. Your solution would also
>>>> require making a copy of the input sequence to allow appending of the
>>>> sentinel unless you consider mutating input parameters to be OK. My
>>>
>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>> to append the 0xFF byte.
>>
>> Are you for real? That sounds like a really stupid idea.
>
> The goal is to make the fastest possible validation of UTF-8 and
> translation to UTF-32. Within this binding contsraint there are few
> options. Copying the input data is not one of them. What else does that
> leave? Mutating the Input and then changing it back?
>

Either you are holding the entire file in memory or performing a buffered
read, either way you can append the sentinel to the data in memory unless
you are using memory mapped I/O. The only use-case that benefits from
having a sentinel is if the input is in memory and you have indicated this
is not a primary use-case so why bother with a sentinel at all? When
performing file I/O your algorithm is unlikely to be the bottleneck sentinel
or no sentinel. As I would not use a sentinel for this I would not have the
dilemma of mutating the input that you face and it would work for any
use-case (input in a file, network or memory).

/Leigh

From: Leigh Johnston on 19 May 2010 15:28

"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
news:r-GdnWFg9qxjp2nWnZ2dnUVZ8oOdnZ2d(a)giganews.com...
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:69ydnUm-AOC_pWnWnZ2dnUVZ_u2dnZ2d(a)giganews.com...
>> On 5/19/2010 1:42 PM, Leigh Johnston wrote:
>>>
>>>
>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>> news:P6idnX4azPfvs2nWnZ2dnUVZ_oudnZ2d(a)giganews.com...
>>>>>
>>>>> Whilst what you say is technically correct I try to avoid writing code
>>>>> which does not check against an end iterator when iterating over a
>>>>> sequence, just personal preference (due to a slight concern re
>>>>> safety).
>>>>> We are probably only talking about an extra CPU instruction or two to
>>>>> check for end of sequence in the main loop along with the O(1) check
>>>>> of
>>>>> the final state when the main loop is exited. Your solution would also
>>>>> require making a copy of the input sequence to allow appending of the
>>>>> sentinel unless you consider mutating input parameters to be OK. My
>>>>
>>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>>> to append the 0xFF byte.
>>>
>>> Are you for real? That sounds like a really stupid idea.
>>
>> The goal is to make the fastest possible validation of UTF-8 and
>> translation to UTF-32. Within this binding contsraint there are few
>> options. Copying the input data is not one of them. What else does that
>> leave? Mutating the Input and then changing it back?
>>
>
> Either you are holding the entire file in memory or performing a buffered
> read, either way you can append the sentinel to the data in memory unless
> you are using memory mapped I/O. The only use-case that benefits from
> having a sentinel is if the input is in memory and you have indicated this
> is not a primary use-case so why bother with a sentinel at all? When
> performing file I/O your algorithm is unlikely to be the bottleneck
> sentinel or no sentinel. As I would not use a sentinel for this I would
> not have the dilemma of mutating the input that you face and it would work
> for any use-case (input in a file, network or memory).
>
> /Leigh

Of course I meant "memory-mapped file" rather than "memory mapped I/O".

/Leigh

From: Paul Bibbings on 19 May 2010 15:40

Peter Olcott <NoSpam(a)OCR4Screen.com> writes:

> The goal is to make the fastest possible validation of UTF-8 and
> translation to UTF-32. Within this binding contsraint there are few
> options. Copying the input data is not one of them. What else does
> that leave? Mutating the Input and then changing it back?

Given what we have had to endure already, and given what can easily be
discovered about the OP from a simple Google `"Peter Olcott" troll', I
was about to suggest that, if any such further provocative question as
the above actually generated *any* hint of an attempted answer, then I
would happily believe that anything at all is possible on Usenet. I see
that I was too late.

Bring on the pixies/unicorns/whatever.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Does anyone copyright or patent their applications?
Next: Designing a Finite State Machine DFA Recognizer for UTF-8