New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Mihai N. on 17 May 2010 02:35

> I studied the derivation of the above regular expression in considerable
> depth. I understand UTF-8 encoding quite well. So far I have found no
> error.

> It is published on w3c.org.

Stop repeating this nonsense.

The URL is http://www.w3.org/2005/03/23-lex-U and the post states:
"It is not endorsed by the W3C members, team, or any working group."

It is a hack implemented by someone and it happens to be on the w3c server.
This is not enough to make it right. If I post something on the free blogging
space offered by Microsoft, you will take as law and say "it is published
on microsoft.com?

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Joseph M. Newcomer on 17 May 2010 02:44

See below...
On Sun, 16 May 2010 23:20:21 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 10:09 PM, Joseph M. Newcomer wrote:
>> OMG! Another Peter "Anyone who tells me the answer is different than my preconceived
>> answer is an idiot, and here's the proof!" post.
>>
>> Why am I not surprised?
>>
>> Of course, the "justification" is the usual "fast and easy" and like most anti-Unicode
>> answers still thinks that string size actually matters in any but the most esoteric
>> situations. I wonder how people actually manage to set priorities when they have no
>> concept of costs. Space is generally a useless argument, and speed is certainly slower
>> when you have to keep checking for MBCS encodings of any sort. Plus, you can't write code
>> that passes character values around.
>>
>> More below...
>>
>> On Sun, 16 May 2010 07:34:11 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> Since the reason for using other encodings than UTF-8 is
>>> speed and ease of use, a string that is as fast and easy to
>>> use (as the strings of other encodings) that often takes
>>> less space would be superior to alternative strings.
>>>
>>> I have derived a design for a utf8string that implements the
>>> most useful subset of std::string. I match the std::string
>>> interface to keep the learning curve to an absolute minimum.
>>>
>>> I just figured out a way to make most of utf8string
>>> operations take about the same amount of time and space as
>>> std::string operations. All of the other utf8string
>>> operations take a minimum amount of time and space over
>>> std::string. These operations involve
>>> construction/validation and converting to and from Unicode
>>> CodePoints.
>>>
>>> class utf8string {
>>> unsigned int BytePerCodepoint;
>> ****
>> WHat does this value represent, and why is it not declared UINT?
>> ****
>>> std::vector<unsigned char> Data;
>>> std::vector<unsigned int> Index;
>> ****
>> What is an index array for? Positions in the string? Sounds to me like this is going to
>> be FAR less efficient in space than a Unicode string. Since the bytes required for each
>> code point are 1, 2, 3 or 4, and each code point determines how many bytes are required,
>> it is not clear how a single integer can encode this.
>>> }
>>>
>>> I use this regular expression found on this link:
>>> http://www.w3.org:80/2005/03/23-lex-U
>>>
>>> 1 ['\u0000'-'\u007F']
>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>> ['\u0080'-'\u00BF'])
>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>
>>> To build a finite state machine (DFA) recognizer for UTF-8
>>> strings. There is no faster or simpler way to validate and
>>> divide a string of bytes into their corresponding Unicode
>>> code points than a finite state machine.
>>>
>>> Since most (if not all) character sets always have a
>>> consistent number of BytePerCodepoint, this value can be
>>> used to quickly get to any specific CodePoint in the UTF-8
>>> encoded data. For ASCII strings this will have a value of
>>> One.
>> ****
>> I gues in Peter's Fantsy Character Set Encoding this must be true; it is certainly NOT
>> true in UTF-8 encoding. But don't let Reality interfere with your design!
>
>This from a guy that does not even take the time to spell "guess" or
>"Fantasy" correctly.
****
Or perhaps I am constrained in my typing because I am using a laptop in an awkward
position. And my newgroup reader does not have realtime spelling correction. And perhaps
you are more concerned with superficial issues than deep issues
****
>
>If you are not a liar then show an error in the above regular
>expression, I dare you.
***
I have already pointed out that it is insufficient for lexically recognizing accent marks
or invalid combinations of accent marks. So the requirement of demonstrating an error is
trivially met.

In addition, the regexp values do not account for directional changes in the parse, which
is essential, for reasons I explained in another response.

It would be easier if you had expressed it as Unicode codepoints; then it would be easy to
show the numerous failures. I'm sorry, I thought you had already applied exhaustive
categorical reasoning to this, which would have demonstrated the errors.

Unfortunately, in my explanation of why the regexp is correct, I had not only said "No"
but wasted the time to explain why; but you had insisted you wanted a simple "yes/no"
answer, and clearly you do not. You want an explanation of why "no" is the correct
answer. Try to be consistent. A little research on your part wouldn't hurt; have you
actually READ the Unicode Standard 5.0? If you haven't, you have no business criticizing
anyone who responds to your question. All the concerns you have and the answers you need
are trivially derivable from that book.

I doubt that you know enough about Chinese, Japanese or Korean to tell if it can properly
parse a sequence of lexical entities in those languages. So why do you think some random
regexp with massive ranges is smart enough to do it?
****
>
>I studied the derivation of the above regular expression in considerable
>depth. I understand UTF-8 encoding quite well. So far I have found no
>error. It is published on w3c.org.
****
So what? A few weeks ago, you had never heard of W3C and now you claim everything on
their Web site should be treated as Gospel, and must not be quetioned by mere mortals?

You have not said what the original purpose of their spec is, and it probably has to do
with HTML parsing, which is not the same as parsing an input language for a compiler.

Perhaps you need to realize that a correct question always appears in a context, and as a
standalone question completely out of context it has no meaning.
****
>
>You may be right about using UTF-8 as an internal representation. This
>thread is my strongest "devil's advocate" case for using UTF-8 as an
>internal representation.
>
>> *****
>>>
>>> In those rare cases where a single utf8string has differing
>>> length bytes sequences representing CodePoints, then the
>>> std::vector<unsigned int> Index;
>>> is derived. This std::vector stores the subscript within
>>> Data where each CodePoint begins. It is derived once during
>>> construction which is when validation occurs. A flag value
>>> of Zero is assigned to BytePerCodepoint indicates that the
>>> Index is needed.
>> ****
>> And this is "faster" exactly HOW? It uses "less space" exactly HOW?
>
>ASCII only needs a single byte per code point.
>utf8string::BytePerCodepoint = 1; // for ASCII
>ASCII does not need to use the Index data member.
*****
But any blend of 7-bit ASCII (you forgot to qualify that by limiting it to the 7-bit
subset) and 8-bit ASCII or Unicode < UFFFF requires a MIX of 1-byte and 2-byte encodings,
so you immediately have to use the index member if even ONE U0000-U007F character appears
in the string! Which is going to be a fairly common scenario outside the U.S., even in
Canada and Mexico, not to mention Europe, even if they use only their native code pages!

So the representation is going to require the index vector for the variable

m�dchen

so you have imposed a complex solution to what should have been a simple problem!
joe

>
>>
>> Sounds to me like a horrible kludge that solves a problem that should not need to exist.
>>
>> And in looking at the encodings, I have found more exceptions to the encodings than the
>> number of regular expressions given, so either some letters are not being included or some
>> non-letters ARE being included. But hey, apparently this regexp ws found "On the
>> Internet" so it MUST be correct! I already pointed out in an earlier post why it is
>> almost certainly deeply flawed.
>>>
>>> For the ASCII character set the use of utf8string is just as
>>> fast and uses hardly no more space than std::string. For
>>> other character sets utf8string is most often just as fast
>>> as std::string, and only uses a minimal increment of
>>> additional space only when needed. Even Chinese most often
>>> only takes three bytes.
>> ****
>> And this handles these encodings HOW? by assuming they all take 3 bytes?
>> joe
>> ****
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Oliver Regenfelder on 17 May 2010 04:12

Hello,

Joseph M. Newcomer wrote:
>>> utf8string handles all of the conversions needed transparently. Most often
>>> no conversion is needed. Because of this it is easier to use than the
>>> methods that you propose. It always works for any character set with
>>> maximum speed, and less space.
>> Like I said your solution is suboptimal, working with std::string and
>> std::wstring and providing free functions to convert between different
>> encoding formats is not suboptimal especially when the host operating
>> system's native Unicode encoding is not UTF-8.
> ****
> Now, now, you are trying to be RATIONAL! This never works.
> joe

I think you tried that with Peter yourself for a very long time.

Best regards,

Oliver

From: Oliver Regenfelder on 17 May 2010 04:22

Hello,

Joseph M. Newcomer wrote:
> THis one makes no
> sense. There will be ORDERS OF MAGNITUDE greater differences in input time if you take
> rotational latency and seek time into consideration (in fact, opening the file will have
> orders of magnitude more variance than the cost of a UTF-8 to UTF-16 or even UTF-32
> conversion, because of the directory lookup time variance).

Do yourself a favor Peter and believe him!
A harddisk takes half guessed (seek + half a rotation @ 7.200 rpm)
~12-14 ms to reach a sector for IO and that is only raw hardware
delay. On networks you will have roundtrip times of maybe 60ms and
more (strongly depends on your INET connection and server location). So
any computational effort for your string convertion doesn't matter.
Especially, as your script language files won't be in the gigabyte range.

Best regards,

Oliver

From: Peter Olcott on 17 May 2010 09:08

On 5/17/2010 1:35 AM, Mihai N. wrote:
>
>> I studied the derivation of the above regular expression in considerable
>> depth. I understand UTF-8 encoding quite well. So far I have found no
>> error.
>
>> It is published on w3c.org.
>
> Stop repeating this nonsense.
>
> The URL is http://www.w3.org/2005/03/23-lex-U and the post states:
> "It is not endorsed by the W3C members, team, or any working group."
>
> It is a hack implemented by someone and it happens to be on the w3c server.
> This is not enough to make it right. If I post something on the free blogging
> space offered by Microsoft, you will take as law and say "it is published
> on microsoft.com?
>
>
Do you know of any faster way to validate and divide a UTF-8 sequence
into its constituent code point parts than a regular expression
implemented as a finite state machine? (please don't cite a software
package, I am only interested in the underlying methodology).

To the very best of my knowledge (and I have a patent on a finite state
recognizer) a regular expression implemented as a finite state machine
is the fastest and simplest possible way of every way that can possibly
exist to validate a UTF-8 sequence and divide it into its constituent
parts.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish