New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Joseph M. Newcomer on 16 May 2010 23:18

See below...
On Sun, 16 May 2010 09:37:23 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 8:21 AM, Leigh Johnston wrote:
>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>> news:hsSdnSJTmcrZe3LWnZ2dnUVZ_q6dnZ2d(a)giganews.com...
>>> Since the reason for using other encodings than UTF-8 is speed and
>>> ease of use, a string that is as fast and easy to use (as the strings
>>> of other encodings) that often takes less space would be superior to
>>> alternative strings.
>>>
>>> I have derived a design for a utf8string that implements the most
>>> useful subset of std::string. I match the std::string interface to
>>> keep the learning curve to an absolute minimum.
>>>
>>> I just figured out a way to make most of utf8string operations take
>>> about the same amount of time and space as std::string operations. All
>>> of the other utf8string operations take a minimum amount of time and
>>> space over std::string. These operations involve
>>> construction/validation and converting to and from Unicode CodePoints.
>>>
>>> class utf8string {
>>> unsigned int BytePerCodepoint;
>>> std::vector<unsigned char> Data;
>>> std::vector<unsigned int> Index;
>>> }
>>>
>>> I use this regular expression found on this link:
>>> http://www.w3.org:80/2005/03/23-lex-U
>>>
>>> 1 ['\u0000'-'\u007F']
>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>>
>>> To build a finite state machine (DFA) recognizer for UTF-8 strings.
>>> There is no faster or simpler way to validate and divide a string of
>>> bytes into their corresponding Unicode code points than a finite state
>>> machine.
>>>
>>> Since most (if not all) character sets always have a consistent number
>>> of BytePerCodepoint, this value can be used to quickly get to any
>>> specific CodePoint in the UTF-8 encoded data. For ASCII strings this
>>> will have a value of One.
>>>
>>> In those rare cases where a single utf8string has differing length
>>> bytes sequences representing CodePoints, then the
>>> std::vector<unsigned int> Index;
>>> is derived. This std::vector stores the subscript within Data where
>>> each CodePoint begins. It is derived once during construction which is
>>> when validation occurs. A flag value of Zero is assigned to
>>> BytePerCodepoint indicates that the Index is needed.
>>>
>>> For the ASCII character set the use of utf8string is just as fast and
>>> uses hardly no more space than std::string. For other character sets
>>> utf8string is most often just as fast as std::string, and only uses a
>>> minimal increment of additional space only when needed. Even Chinese
>>> most often only takes three bytes.
>>
>> Why do you insist on flogging this dead horse?
>
>I just came up with this improved design this morning.
>
> > I suspect most of us are
>> happy storing UTF-8 in an ordinary std::string and converting (to
>> std::wstring for example) as and when required, I certainly am. Your
>> solution has little general utility: working with UTF-16 (std::wstring)
>> can be more efficient than constantly decoding individual code points
>> from UTF-8 like you suggest.
>>
>> /Leigh
>
>Neither std::string nor std::wstring know anything at all about Unicode.
>All Unicode based operations require very substantial manual
>intervention to work correctly with std::string or std::wstring.
>utf8string makes all of this transparent to the user.
****
How is it that std::string knows about ANSI encoding? Actually, as far as I can tell, it
doesn't, and it doesn't know anything about UTF-8 either, requiring very substantial
manual intervention. But gee, I've been using Unicode for only about 14 years now, so
what do I know about it?
****
>
>There are very few instances where the utf8string need be converted to
>individual code points. In almost all cases there is no need for this.
>If you are mixing character sets with differing byte length encodings
>(such as Chinese and English) in the same utf8string, then this would be
>needed. I can't imagine any other reasons to need to translate from
>UTF-8 to code points.
****
Other than it simplifies your recognizer, makes your coding simpler and more efficient,
and is less error-prone.
****
>
>UTF-8 is the standard Unicode data interchange format. This aspect is
>crucial to internet based applications. Unlike other encodings UTF-8
>works the same way on every machine architecture not requiring any
>accounting or adaptation for things such as Little or Big Endian.
****
But why should you care in the slightest? Endianness is handled in the UTF-8 to Unicode
conversion? You are making a good argument for why UTF-8 is a good EXTERNAL
representation (which I would not argue with) but then you are extending this to the idea
that it necessarily makes a good INTERNAL representation as well, which is NOT true.
****
>
>utf8string handles all of the conversions needed transparently. Most
>often no conversion is needed. Because of this it is easier to use than
>the methods that you propose. It always works for any character set with
>maximum speed, and less space.
*****
No, it does not work with "maximum speed", and the space argument is childish.
****
>
>If the use is focused on Asian character sets, then a UTF-16 string
>would take less space. If an application must handle every character
>set, then the space savings for ASCII will likely outweight the
>additional space cost of UTF-16. The reason for this is that studies
>have shown that the United States consumes about one half of the world's
>supply of software. In any case conversions can be provided between
>utf8string and utf16string. utf16string would have an identical design.
*****
Actually, UTF-8 isn't a very good encoding for some languages, such as Chinese; UTF-32 is
muich better, but Microsoft refuses to support it, even though it has been the standard
for many years.

But "flogging a dead horse"? Peter is an expert at this, and, sadly, the horse always
remains dead.
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 16 May 2010 23:24

See below...
On Sun, 16 May 2010 15:53:48 +0100, "Leigh Johnston" <leigh(a)i42.co.uk> wrote:

>"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>news:oJWdnaEbIuq5nm3WnZ2dnUVZ_tmdnZ2d(a)giganews.com...
>>
>> Neither std::string nor std::wstring know anything at all about Unicode.
>> All Unicode based operations require very substantial manual intervention
>> to work correctly with std::string or std::wstring. utf8string makes all
>> of this transparent to the user.
>
>std::string can be used to hold UTF-8, and std::wstring can be used to hold
>UTF-16. I see no need for a class which tries to do both and does so
>suboptimally.
****
You have obviously never been in a debate with Peter; the point here is not to come up
with a GOOD solution, but a solution that justifies his bad design decisions.
****
>
>> There are very few instances where the utf8string need be converted to
>> individual code points. In almost all cases there is no need for this.
>> If you are mixing character sets with differing byte length encodings
>> (such as Chinese and English) in the same utf8string, then this would be
>> needed. I can't imagine any other reasons to need to translate from UTF-8
>> to code points.
>
>The host operating system's native Unicode encoding is unlikely to be UTF-8,
>it is more likely to be UTF-16 (this is the case for Windows) as UTF-16 is
>more space efficient than UTF-8 when support for non-Latin character sets is
>required. Manipulating UTF-16 will always be more efficient than
>manipulating UTF-8 as UTF-16 is a fixed length encoding (surrogate pairs
>aside) whereas UTF-8 is a variable length encoding.
****
Never try to confuse the issue by pointing out FACTS!
****
>
>> UTF-8 is the standard Unicode data interchange format. This aspect is
>> crucial to internet based applications. Unlike other encodings UTF-8 works
>> the same way on every machine architecture not requiring any accounting or
>> adaptation for things such as Little or Big Endian.
>
>Nobody said otherwise.
>
>>
>> utf8string handles all of the conversions needed transparently. Most often
>> no conversion is needed. Because of this it is easier to use than the
>> methods that you propose. It always works for any character set with
>> maximum speed, and less space.
>
>Like I said your solution is suboptimal, working with std::string and
>std::wstring and providing free functions to convert between different
>encoding formats is not suboptimal especially when the host operating
>system's native Unicode encoding is not UTF-8.
****
Now, now, you are trying to be RATIONAL! This never works.
joe
****
>
>>
>> If the use is focused on Asian character sets, then a UTF-16 string would
>> take less space. If an application must handle every character set, then
>> the space savings for ASCII will likely outweight the additional space
>> cost of UTF-16. The reason for this is that studies have shown that the
>> United States consumes about one half of the world's supply of software.
>> In any case conversions can be provided between utf8string and
>> utf16string. utf16string would have an identical design.
>
>This is not a good argument, I am not convinced of the usefulness of these
>"studies" you cite. We live in an internationalized world, not in a
>USA-centric world.
*****
I am curious what a Chinese "letter" is according to the regexp. Since I know no Chinese,
I can't give any counterexamples, but I once worked for a firm that did do Chinese-based
products and and remember some of the conversations about how complex recognition actually
was. But that was more than a decade ago, and the counterexamples about recognition were
always supplied by the native Chinese programmers, so they were lost on me even then.
joe
****
>
>/Leigh
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 16 May 2010 23:29

See beow...
On Sun, 16 May 2010 09:46:16 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 8:51 AM, �� Tiib wrote:
>> On 16 mai, 15:34, "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote:
>>> Since the reason for using other encodings than UTF-8 is
>>> speed and ease of use, a string that is as fast and easy to
>>> use (as the strings of other encodings) that often takes
>>> less space would be superior to alternative strings.
>>
>> If you care so much ... perhaps throw together your utf8string and let
>> us to see it. Perhaps test& profile it first to compare with
>> Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html
>>
>> I suspect UTF8 fades gradually into history. Reasons are similar like
>> 256 color video-modes and raster-graphic formats went. GUI-s are
>> already often made with java or C# (for lack of C++ devs) and these
>> use UTF16 internally. Notice that modern processor architectures are
>> already optimized in the way that byte-level operations are often
>> slower.
>
>UTF-8 is the best Unicode data-interchange format because it works
>exactly the same way across every machine architecture without the need
>for separate adaptations. It also stores the entire ASCII character set
>in a single byte per code point.
****
How do we make the leap from "best data interchange format" to "best internal
representation"? I fail to see the correlation here. Or why a parser for a C-like
language needs to "save space" by foolish examples of optimiztion. THis issue does not
become important until the files start approaching the gigabyte range.
****
>
>I will put it together because it will become one of my standard tools.
>The design is now essentially complete. Coding this updated design will
>go very quickly. I will put it on my website and provide a free license
>for any use as long as the copyright notice remains in the source code.
>
*****
And if you do not measure its performance, and express that in terms of time and space,
and demonstrate that it runs no slower and consumes enough less space to make a
difference, it is all a colossal waste of time. Most of us know that it will be slower
and either consume insignificantly less space or, given the need of the index vector,
vastly more space (thus also making it slower, because all accesses must be mediated by
the index vector), or end up being measured just in the trivial subcase of 8-bit character
input (not an important measure), the whole design seems just flat-out wrong.
joe
****
joe
*****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 16 May 2010 23:40

On Sun, 16 May 2010 21:40:58 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 3:12 PM, I V wrote:
>> On Sun, 16 May 2010 07:34:11 -0500, Peter Olcott wrote:
>>> Since most (if not all) character sets always have a consistent number
>>> of BytePerCodepoint, this value can be used to quickly get to any
>>
>> I'm not sure that's true. Certainly, the Latin character set(s) that one
>> would use for most Western languages mixes one- and two-byte characters
>> (think again of James Kanze's example of na�ve). Non-Latin character sets
>> would also frequently use ASCII punctuation (e.g. writing in Greek and
>> Cyrillic).
>
>Yes those would be exceptions.
>
>>
>>> In those rare cases where a single utf8string has differing length bytes
>>> sequences representing CodePoints, then the
>>> std::vector<unsigned int> Index;
>>> is derived. This std::vector stores the subscript within Data where each
>>> CodePoint begins. It is derived once during construction which is when
>>> validation occurs. A flag value of Zero is assigned to BytePerCodepoint
>>> indicates that the Index is needed.
>>
>> Note that in this case you are storing an integer per character; this is
>> likely to be four bytes, plus at least one byte for the character itself,
>> that is, one more byte than if you had just translated into UTF-32.
>
>Yes but it depends on how often this is needed. Even if it is needed All
>the time, we still have the advantage of speed. Because almost
>everything (especially including I/O) requires no conversion the
>utf8string may be faster to the extent that conversions are eliminated.
>Most every operation takes about the same time as std::string.
****
Peter, stop focussing on such silly concepts as conversion time mattering in the
slightest, and start giving RATIONAL reasons for your design decisions. THis one makes no
sense. There will be ORDERS OF MAGNITUDE greater differences in input time if you take
rotational latency and seek time into consideration (in fact, opening the file will have
orders of magnitude more variance than the cost of a UTF-8 to UTF-16 or even UTF-32
conversion, because of the directory lookup time variance). So you are saying that you
will save some fraction of a tenth of a percent of overall performance by not converting
to UTF-16. At this point, anyone who has ever realisitcally done performance optimization
is rolling on the floor at the failure to understand where the real problems are. You
have to write and debug some complex class, and write all your code in terms of this
class, and implement complex and probably erroneous regexps to handle the cases, to save
an unmeasuably small amount of time at the input and output edges of your code? Get real!

Why is it you keep inventing complex and unnecessary solutions to simple problems, and
keep giving justifications that don't even make sense? "time" and "space" arguments are
not credible here, because they are optimizing parameters that are so minute as to be
within a fraction of a standard deviation of actual measured performance, or as those of
use who used to worry about these things a LOT used to say "it is lost in the noise".

We used to worry about optimizing things that actually MATTERED.
joe

>
>The primary reason that I am focusing on UTF-8 is that I want to
>internationalize I/O. The scripting language encoding will be UTF-8 so
>that everyone can write scripts in their native language character set.
>I must also provide internationalized output from my OCR4Screen
>character recognition engine.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 17 May 2010 00:20

On 5/16/2010 10:09 PM, Joseph M. Newcomer wrote:
> OMG! Another Peter "Anyone who tells me the answer is different than my preconceived
> answer is an idiot, and here's the proof!" post.
>
> Why am I not surprised?
>
> Of course, the "justification" is the usual "fast and easy" and like most anti-Unicode
> answers still thinks that string size actually matters in any but the most esoteric
> situations. I wonder how people actually manage to set priorities when they have no
> concept of costs. Space is generally a useless argument, and speed is certainly slower
> when you have to keep checking for MBCS encodings of any sort. Plus, you can't write code
> that passes character values around.
>
> More below...
>
> On Sun, 16 May 2010 07:34:11 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>
>> Since the reason for using other encodings than UTF-8 is
>> speed and ease of use, a string that is as fast and easy to
>> use (as the strings of other encodings) that often takes
>> less space would be superior to alternative strings.
>>
>> I have derived a design for a utf8string that implements the
>> most useful subset of std::string. I match the std::string
>> interface to keep the learning curve to an absolute minimum.
>>
>> I just figured out a way to make most of utf8string
>> operations take about the same amount of time and space as
>> std::string operations. All of the other utf8string
>> operations take a minimum amount of time and space over
>> std::string. These operations involve
>> construction/validation and converting to and from Unicode
>> CodePoints.
>>
>> class utf8string {
>> unsigned int BytePerCodepoint;
> ****
> WHat does this value represent, and why is it not declared UINT?
> ****
>> std::vector<unsigned char> Data;
>> std::vector<unsigned int> Index;
> ****
> What is an index array for? Positions in the string? Sounds to me like this is going to
> be FAR less efficient in space than a Unicode string. Since the bytes required for each
> code point are 1, 2, 3 or 4, and each code point determines how many bytes are required,
> it is not clear how a single integer can encode this.
>> }
>>
>> I use this regular expression found on this link:
>> http://www.w3.org:80/2005/03/23-lex-U
>>
>> 1 ['\u0000'-'\u007F']
>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>> ['\u0080'-'\u00BF'])
>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>
>> To build a finite state machine (DFA) recognizer for UTF-8
>> strings. There is no faster or simpler way to validate and
>> divide a string of bytes into their corresponding Unicode
>> code points than a finite state machine.
>>
>> Since most (if not all) character sets always have a
>> consistent number of BytePerCodepoint, this value can be
>> used to quickly get to any specific CodePoint in the UTF-8
>> encoded data. For ASCII strings this will have a value of
>> One.
> ****
> I gues in Peter's Fantsy Character Set Encoding this must be true; it is certainly NOT
> true in UTF-8 encoding. But don't let Reality interfere with your design!

This from a guy that does not even take the time to spell "guess" or
"Fantasy" correctly.

If you are not a liar then show an error in the above regular
expression, I dare you.

I studied the derivation of the above regular expression in considerable
depth. I understand UTF-8 encoding quite well. So far I have found no
error. It is published on w3c.org.

You may be right about using UTF-8 as an internal representation. This
thread is my strongest "devil's advocate" case for using UTF-8 as an
internal representation.

> *****
>>
>> In those rare cases where a single utf8string has differing
>> length bytes sequences representing CodePoints, then the
>> std::vector<unsigned int> Index;
>> is derived. This std::vector stores the subscript within
>> Data where each CodePoint begins. It is derived once during
>> construction which is when validation occurs. A flag value
>> of Zero is assigned to BytePerCodepoint indicates that the
>> Index is needed.
> ****
> And this is "faster" exactly HOW? It uses "less space" exactly HOW?

ASCII only needs a single byte per code point.
utf8string::BytePerCodepoint = 1; // for ASCII
ASCII does not need to use the Index data member.

>
> Sounds to me like a horrible kludge that solves a problem that should not need to exist.
>
> And in looking at the encodings, I have found more exceptions to the encodings than the
> number of regular expressions given, so either some letters are not being included or some
> non-letters ARE being included. But hey, apparently this regexp ws found "On the
> Internet" so it MUST be correct! I already pointed out in an earlier post why it is
> almost certainly deeply flawed.
>>
>> For the ASCII character set the use of utf8string is just as
>> fast and uses hardly no more space than std::string. For
>> other character sets utf8string is most often just as fast
>> as std::string, and only uses a minimal increment of
>> additional space only when needed. Even Chinese most often
>> only takes three bytes.
> ****
> And this handles these encodings HOW? by assuming they all take 3 bytes?
> joe
> ****
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish