Designing a Finite State Machine DFA Recognizer for UTF-8 [MFC]

Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question

From: Joseph M. Newcomer on 20 May 2010 13:49

See below...
On Thu, 20 May 2010 11:43:51 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 11:33 AM, Leigh Johnston wrote:
>>
>>
>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>> news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com...
>>> On 5/20/2010 11:11 AM, James Kanze wrote:
>>>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>>>>
>>>> [...]
>>>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>>>> to append the 0xFF byte.
>>>>
>>>> In the file?
>>>>
>>>> [...]
>>>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>>>>> the data is corrupted.
>>>>
>>>> Am I the only one who senses a problem here. If you're reading
>>>> from an external source (a file), then you have to assume that
>>>> the file might contain anything; people do pass in the wrong
>>>> filename, and your program has to handle that gracefully.
>>>> (Error message, etc.)
>>>>
>>>> --
>>>> James Kanze
>>>
>>> I must be validating UTF-8 and well as converting it to UTF-32. Only a
>>> DFA can do this very quickly.
>>
>> You didn't respond to JK's point. If you require the file to contain
>> 0xFF as the last byte
>
>I do not require this. probably the best tradeoff of the various design
>alternatives keeping maximum speed as the binding constraint is that the
>user passes me a mutable std::vector<unsigned char>. My code both
>appends and then later removes the required 0xFF.
****
But this is also a stupid design, because it tends to impose the requirement of a complete
copy of the string. For someone who is as compulsive about performance as you are,
requiring a string copy seems remarkably silly. Particularly because it can be done at
ZERO cost when the file is read. But I will leave this complex and subtle technique as an
Exercise For The Reader.
****
>
>I could also provide an overloaded immutable function that is slower
>because it must copy all of the data.
****
What are mutable and immutable functions? Perhaps you mean fuctions with const and
non-const arguments? If so, please use the proper technical terms. A const function is
NOT the same as a function that takes a const argument, and the technical term "mutable
function" does not exist in the C or C++ language. In the new C++ standard, the qualifier
"mutable" can be used for data members of a const declaration and for lambda functions
which are not const (implicitly, a lambda function is declared as a const function, so if
it is not there must be a declaration to say it is not). Data (including iterators) can
be specified as mutable, functions do not have this description except for lambda
expressions. But then, I took five minutes to search the C++ standard draft (30). I did
not invent a new technical term that represents something that is undefined.

Note that there is no need to ever copy the input string, because you can impose the
requirement that the input string already have the sentinel appended, and you can append
it at ZERO cost when you read the data from the file, so there is no need for a function
with a non-const parameter.

Also, where in the world did you get the idea that if you pass a non-const string that an
append does not perform a copy? In fact, it frequently will. Nowhere does the
std::string promise a copy is not requred!
joe
****
>
> > then if a wrong file is given by mistake your
>> algorithm will perform a buffer overrun as you only rely on the sentinel
>> to check for end. This is a crash waiting to happen. Better to not rely
>> on a sentinel at all and check if end has been reached each iteration,
>> we are only talking about an extra CPU instruction per iteration
>> (compare and conditional jump versus unconditional jump).
>>
>> /Leigh
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 20 May 2010 13:55

On 5/20/2010 12:30 PM, Joseph M. Newcomer wrote:
> See below...
> On Thu, 20 May 2010 17:33:53 +0100, "Leigh Johnston"<leigh(a)i42.co.uk> wrote:
>
>>
>>
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>> news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com...
>>> On 5/20/2010 11:11 AM, James Kanze wrote:
>>>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>>>>
>>>> [...]
>>>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>>>> to append the 0xFF byte.
>>>>
>>>> In the file?
>>>>
>>>> [...]
>>>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>>>>> the data is corrupted.
> ****
> This was one of the most stupid ideas I have heard proposed in a long time.
>
> Note that every UTF-8 string will be terminated with \x00 (NUL) if it is a canonical
> representation.

I did not know that. Ignorance is not at all the same thing as
stupidity. ASCII zero was one of two alternatives that I originally
provided for my sentinel character. I changed this to 0xFF because I
thought that it might be possible to have more than one embedded ASCII
zeros in the input buffer.

Now that I know it is a [canonical representation] that makes this whole
aspect trivial.

Apparently you are referring to something like
unsigned char* Data
as an input parameter.

If the input is actually
std::vector<unsigned char> data
then can I still expect it to be NULL terminated?

If not then it must be a mutable parameter so that I can append the
required sentinel ASCII Zero.

> The typical way to read a file in Windows is to simply allocate a buffer
> of filesize+sizeof(WCHAR), read in the entire contents of the file, then, given the
> number of bytes read, append two \x00 bytes (which will be one NUL character if it is a
> UTF-16 encoding) to the buffer.

Great, that is much simpler.

> Then you can look for a BOM; if one is found, then you
> adjust the start point to be just past the BOM; if it is UTF-16BE, on Windows you then run
> through and swap the bytes of each UTF-16 character.before working with the data. If it
> is UTF-8, then you treat it as UTF-8 for whatever reason you want UTF-8; if it is
> UTF-16LE, then you treat it as Windows' native UTF-16 encoding and do with it what you
> want. But because two \x00 bytes have been appended, it is already a NUL-terminated
> string. This is not Rocket Science, and it does not impose on the end user the need to
> insert a non-standard character at the end of the file. What, exactly, is the problem
> that appending a \xFF to the file solve that appending a \x00 byte after the file is read
> does not?
>
> Note this algorithm can be generalized to support the possibility of UTF-32LE and UTF-32BE
> input files. But I leave that generalization as an Exercise For The Reader.
>
> Requiring the user put some weird character at the end of the file is just a stupid
> design. No sane designer (let alone a superb designer) would impose such an ubelievably
> stupid requirement!
> joe

Not stupid at all, merely ignorant, there is a huge difference.

>
>>>>
>>>> Am I the only one who senses a problem here. If you're reading
>>>> from an external source (a file), then you have to assume that
>>>> the file might contain anything; people do pass in the wrong
>>>> filename, and your program has to handle that gracefully.
>>>> (Error message, etc.)
>>>>
>>>> --
>>>> James Kanze
>>>
>>> I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA
>>> can do this very quickly.
>>
>> You didn't respond to JK's point. If you require the file to contain 0xFF
>> as the last byte then if a wrong file is given by mistake your algorithm
>> will perform a buffer overrun as you only rely on the sentinel to check for
>> end. This is a crash waiting to happen. Better to not rely on a sentinel
>> at all and check if end has been reached each iteration, we are only talking
>> about an extra CPU instruction per iteration (compare and conditional jump
>> versus unconditional jump).
>>
>> /Leigh
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 20 May 2010 14:05

On 5/20/2010 12:49 PM, Joseph M. Newcomer wrote:
> See below...
> On Thu, 20 May 2010 11:43:51 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> I do not require this. probably the best tradeoff of the various design
>> alternatives keeping maximum speed as the binding constraint is that the
>> user passes me a mutable std::vector<unsigned char>. My code both
>> appends and then later removes the required 0xFF.
> ****
> But this is also a stupid design, because it tends to impose the requirement of a complete
> copy of the string. For someone who is as compulsive about performance as you are,
> requiring a string copy seems remarkably silly. Particularly because it can be done at
> ZERO cost when the file is read. But I will leave this complex and subtle technique as an
> Exercise For The Reader.

No.

void UTF8_to_UTF32(std::vector<unsigned char>& UTF8,
std::vector<unsigned int>& UTF32);

> ****
>>
>> I could also provide an overloaded immutable function that is slower
>> because it must copy all of the data.
> ****
> What are mutable and immutable functions? Perhaps you mean fuctions with const and
> non-const arguments? If so, please use the proper technical terms. A const function is
> NOT the same as a function that takes a const argument, and the technical term "mutable
> function" does not exist in the C or C++ language. In the new C++ standard, the qualifier
> "mutable" can be used for data members of a const declaration and for lambda functions
> which are not const (implicitly, a lambda function is declared as a const function, so if
> it is not there must be a declaration to say it is not). Data (including iterators) can
> be specified as mutable, functions do not have this description except for lambda
> expressions. But then, I took five minutes to search the C++ standard draft (30). I did
> not invent a new technical term that represents something that is undefined.
>
> Note that there is no need to ever copy the input string, because you can impose the
> requirement that the input string already have the sentinel appended, and you can append
> it at ZERO cost when you read the data from the file, so there is no need for a function
> with a non-const parameter.

This is great if it is reasonable for a
std::vector<unsigned char>& UTF8
parameter as well as a
unsigned char* UTF8
parameter.

I would think that most users might think it a little clumsy to require
std::vectors to require NULL terminating bytes.

>
> Also, where in the world did you get the idea that if you pass a non-const string that an
> append does not perform a copy? In fact, it frequently will. Nowhere does the
> std::string promise a copy is not requred!
> joe

Even for reference parameters? If it does this for reference parameters
then I would say that it is semantically incorrect.

> ****
>>
>>> then if a wrong file is given by mistake your
>>> algorithm will perform a buffer overrun as you only rely on the sentinel
>>> to check for end. This is a crash waiting to happen. Better to not rely
>>> on a sentinel at all and check if end has been reached each iteration,
>>> we are only talking about an extra CPU instruction per iteration
>>> (compare and conditional jump versus unconditional jump).
>>>
>>> /Leigh
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 20 May 2010 20:19

See below....
On Thu, 20 May 2010 12:55:48 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 12:30 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Thu, 20 May 2010 17:33:53 +0100, "Leigh Johnston"<leigh(a)i42.co.uk> wrote:
>>
>>>
>>>
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>> news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com...
>>>> On 5/20/2010 11:11 AM, James Kanze wrote:
>>>>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>>>>>
>>>>> [...]
>>>>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>>>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>>>>> to append the 0xFF byte.
>>>>>
>>>>> In the file?
>>>>>
>>>>> [...]
>>>>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>>>>>> the data is corrupted.
>> ****
>> This was one of the most stupid ideas I have heard proposed in a long time.
>>
>> Note that every UTF-8 string will be terminated with \x00 (NUL) if it is a canonical
>> representation.
>
>I did not know that. Ignorance is not at all the same thing as
>stupidity. ASCII zero was one of two alternatives that I originally
>provided for my sentinel character. I changed this to 0xFF because I
>thought that it might be possible to have more than one embedded ASCII
>zeros in the input buffer.
****
But the whole POINT of UTF-8 is that there are no 0 bytes anywhere in it! Why would you
think a file had embedded 0 bytes? No text editor would ever do this. In fact, many text
editors GUARANTEE that this cannot happen. If it does, the file is ill-formed, and it is
not your problem. You ignore it.

Under normal conidtions, when you do a ReadFile of the entire file contents, there is NO
zero byte appended, so typically you append two by creating a buffer two bytes larger than
you need. This allows you to insert the 0 bytes before you worry about looking for a BOM
or doing any heuristic test to determine the encoding of the text in the file, and no
matter what you discover, the text you read will be a properly-NUL-terimnated string!

It does not happen by magic.
joe

****

>
>Now that I know it is a [canonical representation] that makes this whole
>aspect trivial.
>
>Apparently you are referring to something like
>unsigned char* Data
>as an input parameter.
>
>If the input is actually
>std::vector<unsigned char> data
>then can I still expect it to be NULL terminated?
>
>If not then it must be a mutable parameter so that I can append the
>required sentinel ASCII Zero.
>
>> The typical way to read a file in Windows is to simply allocate a buffer
>> of filesize+sizeof(WCHAR), read in the entire contents of the file, then, given the
>> number of bytes read, append two \x00 bytes (which will be one NUL character if it is a
>> UTF-16 encoding) to the buffer.
>
>Great, that is much simpler.
>
>> Then you can look for a BOM; if one is found, then you
>> adjust the start point to be just past the BOM; if it is UTF-16BE, on Windows you then run
>> through and swap the bytes of each UTF-16 character.before working with the data. If it
>> is UTF-8, then you treat it as UTF-8 for whatever reason you want UTF-8; if it is
>> UTF-16LE, then you treat it as Windows' native UTF-16 encoding and do with it what you
>> want. But because two \x00 bytes have been appended, it is already a NUL-terminated
>> string. This is not Rocket Science, and it does not impose on the end user the need to
>> insert a non-standard character at the end of the file. What, exactly, is the problem
>> that appending a \xFF to the file solve that appending a \x00 byte after the file is read
>> does not?
>>
>> Note this algorithm can be generalized to support the possibility of UTF-32LE and UTF-32BE
>> input files. But I leave that generalization as an Exercise For The Reader.
>>
>> Requiring the user put some weird character at the end of the file is just a stupid
>> design. No sane designer (let alone a superb designer) would impose such an ubelievably
>> stupid requirement!
>> joe
>
>Not stupid at all, merely ignorant, there is a huge difference.
>
>>
>>>>>
>>>>> Am I the only one who senses a problem here. If you're reading
>>>>> from an external source (a file), then you have to assume that
>>>>> the file might contain anything; people do pass in the wrong
>>>>> filename, and your program has to handle that gracefully.
>>>>> (Error message, etc.)
>>>>>
>>>>> --
>>>>> James Kanze
>>>>
>>>> I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA
>>>> can do this very quickly.
>>>
>>> You didn't respond to JK's point. If you require the file to contain 0xFF
>>> as the last byte then if a wrong file is given by mistake your algorithm
>>> will perform a buffer overrun as you only rely on the sentinel to check for
>>> end. This is a crash waiting to happen. Better to not rely on a sentinel
>>> at all and check if end has been reached each iteration, we are only talking
>>> about an extra CPU instruction per iteration (compare and conditional jump
>>> versus unconditional jump).
>>>
>>> /Leigh
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Paul N on 21 May 2010 07:31

On 19 May, 19:00, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> Please show me where it says that "swearing on the Internet is
> unprofessional" is a universal rule?

It's not a universdal rule, but what you put on the internet can
rebound in unexpected ways. See for example http://www.msnbc.msn.com/id/18372103/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question