New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Leigh Johnston on 16 May 2010 10:53

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:oJWdnaEbIuq5nm3WnZ2dnUVZ_tmdnZ2d(a)giganews.com...
>
> Neither std::string nor std::wstring know anything at all about Unicode.
> All Unicode based operations require very substantial manual intervention
> to work correctly with std::string or std::wstring. utf8string makes all
> of this transparent to the user.

std::string can be used to hold UTF-8, and std::wstring can be used to hold
UTF-16. I see no need for a class which tries to do both and does so
suboptimally.

> There are very few instances where the utf8string need be converted to
> individual code points. In almost all cases there is no need for this.
> If you are mixing character sets with differing byte length encodings
> (such as Chinese and English) in the same utf8string, then this would be
> needed. I can't imagine any other reasons to need to translate from UTF-8
> to code points.

The host operating system's native Unicode encoding is unlikely to be UTF-8,
it is more likely to be UTF-16 (this is the case for Windows) as UTF-16 is
more space efficient than UTF-8 when support for non-Latin character sets is
required. Manipulating UTF-16 will always be more efficient than
manipulating UTF-8 as UTF-16 is a fixed length encoding (surrogate pairs
aside) whereas UTF-8 is a variable length encoding.

> UTF-8 is the standard Unicode data interchange format. This aspect is
> crucial to internet based applications. Unlike other encodings UTF-8 works
> the same way on every machine architecture not requiring any accounting or
> adaptation for things such as Little or Big Endian.

Nobody said otherwise.

>
> utf8string handles all of the conversions needed transparently. Most often
> no conversion is needed. Because of this it is easier to use than the
> methods that you propose. It always works for any character set with
> maximum speed, and less space.

Like I said your solution is suboptimal, working with std::string and
std::wstring and providing free functions to convert between different
encoding formats is not suboptimal especially when the host operating
system's native Unicode encoding is not UTF-8.

>
> If the use is focused on Asian character sets, then a UTF-16 string would
> take less space. If an application must handle every character set, then
> the space savings for ASCII will likely outweight the additional space
> cost of UTF-16. The reason for this is that studies have shown that the
> United States consumes about one half of the world's supply of software.
> In any case conversions can be provided between utf8string and
> utf16string. utf16string would have an identical design.

This is not a good argument, I am not convinced of the usefulness of these
"studies" you cite. We live in an internationalized world, not in a
USA-centric world.

/Leigh

From: Öö Tiib on 16 May 2010 13:28

On 16 mai, 17:46, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
>
> UTF-8 is the best Unicode data-interchange format because it works
> exactly the same way across every machine architecture without the need
> for separate adaptations. It also stores the entire ASCII character set
> in a single byte per code point.

Similarly is Portable Network Graphics good format to interchange
raster graphics. Gimp, Photoshop etc. however do not use such packed
format for graphics manipulation internally. They use their own
internal format to achieve manipulation speed and flexibility. You
insist using interchange format for manipulation. It may be good or
bad idea depends on context.

> I will put it together because it will become one of my standard tools.
> The design is now essentially complete. Coding this updated design will
> go very quickly. I will put it on my website and provide a free license
> for any use as long as the copyright notice remains in the source code.

Great.

From: I V on 16 May 2010 16:12

On Sun, 16 May 2010 07:34:11 -0500, Peter Olcott wrote:
> Since most (if not all) character sets always have a consistent number
> of BytePerCodepoint, this value can be used to quickly get to any

I'm not sure that's true. Certainly, the Latin character set(s) that one
would use for most Western languages mixes one- and two-byte characters
(think again of James Kanze's example of naïve). Non-Latin character sets
would also frequently use ASCII punctuation (e.g. writing in Greek and
Cyrillic).

> In those rare cases where a single utf8string has differing length bytes
> sequences representing CodePoints, then the
> std::vector<unsigned int> Index;
> is derived. This std::vector stores the subscript within Data where each
> CodePoint begins. It is derived once during construction which is when
> validation occurs. A flag value of Zero is assigned to BytePerCodepoint
> indicates that the Index is needed.

Note that in this case you are storing an integer per character; this is
likely to be four bytes, plus at least one byte for the character itself,
that is, one more byte than if you had just translated into UTF-32.

From: Peter Olcott on 16 May 2010 22:40

On 5/16/2010 3:12 PM, I V wrote:
> On Sun, 16 May 2010 07:34:11 -0500, Peter Olcott wrote:
>> Since most (if not all) character sets always have a consistent number
>> of BytePerCodepoint, this value can be used to quickly get to any
>
> I'm not sure that's true. Certainly, the Latin character set(s) that one
> would use for most Western languages mixes one- and two-byte characters
> (think again of James Kanze's example of naïve). Non-Latin character sets
> would also frequently use ASCII punctuation (e.g. writing in Greek and
> Cyrillic).

Yes those would be exceptions.

>
>> In those rare cases where a single utf8string has differing length bytes
>> sequences representing CodePoints, then the
>> std::vector<unsigned int> Index;
>> is derived. This std::vector stores the subscript within Data where each
>> CodePoint begins. It is derived once during construction which is when
>> validation occurs. A flag value of Zero is assigned to BytePerCodepoint
>> indicates that the Index is needed.
>
> Note that in this case you are storing an integer per character; this is
> likely to be four bytes, plus at least one byte for the character itself,
> that is, one more byte than if you had just translated into UTF-32.

Yes but it depends on how often this is needed. Even if it is needed All
the time, we still have the advantage of speed. Because almost
everything (especially including I/O) requires no conversion the
utf8string may be faster to the extent that conversions are eliminated.
Most every operation takes about the same time as std::string.

The primary reason that I am focusing on UTF-8 is that I want to
internationalize I/O. The scripting language encoding will be UTF-8 so
that everyone can write scripts in their native language character set.
I must also provide internationalized output from my OCR4Screen
character recognition engine.

From: Joseph M. Newcomer on 16 May 2010 23:09

OMG! Another Peter "Anyone who tells me the answer is different than my preconceived
answer is an idiot, and here's the proof!" post.

Why am I not surprised?

Of course, the "justification" is the usual "fast and easy" and like most anti-Unicode
answers still thinks that string size actually matters in any but the most esoteric
situations. I wonder how people actually manage to set priorities when they have no
concept of costs. Space is generally a useless argument, and speed is certainly slower
when you have to keep checking for MBCS encodings of any sort. Plus, you can't write code
that passes character values around.

More below...

On Sun, 16 May 2010 07:34:11 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>Since the reason for using other encodings than UTF-8 is
>speed and ease of use, a string that is as fast and easy to
>use (as the strings of other encodings) that often takes
>less space would be superior to alternative strings.
>
>I have derived a design for a utf8string that implements the
>most useful subset of std::string. I match the std::string
>interface to keep the learning curve to an absolute minimum.
>
>I just figured out a way to make most of utf8string
>operations take about the same amount of time and space as
>std::string operations. All of the other utf8string
>operations take a minimum amount of time and space over
>std::string. These operations involve
>construction/validation and converting to and from Unicode
>CodePoints.
>
>class utf8string {
> unsigned int BytePerCodepoint;
****
WHat does this value represent, and why is it not declared UINT?
****
> std::vector<unsigned char> Data;
> std::vector<unsigned int> Index;
****
What is an index array for? Positions in the string? Sounds to me like this is going to
be FAR less efficient in space than a Unicode string. Since the bytes required for each
code point are 1, 2, 3 or 4, and each code point determines how many bytes are required,
it is not clear how a single integer can encode this.
>}
>
>I use this regular expression found on this link:
> http://www.w3.org:80/2005/03/23-lex-U
>
>1 ['\u0000'-'\u007F']
>2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>['\u0080'-'\u00BF'])
>4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>['\u0080'-'\u00BF'])
>5 | ( '\u00ED' ['\u0080'-'\u009F']
>['\u0080'-'\u00BF'])
>6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>['\u0080'-'\u00BF'])
>7 | ( '\u00F0' ['\u0090'-'\u00BF']
>['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>9 | ( '\u00F4' ['\u0080'-'\u008F']
>['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>
>To build a finite state machine (DFA) recognizer for UTF-8
>strings. There is no faster or simpler way to validate and
>divide a string of bytes into their corresponding Unicode
>code points than a finite state machine.
>
>Since most (if not all) character sets always have a
>consistent number of BytePerCodepoint, this value can be
>used to quickly get to any specific CodePoint in the UTF-8
>encoded data. For ASCII strings this will have a value of
>One.
****
I gues in Peter's Fantsy Character Set Encoding this must be true; it is certainly NOT
true in UTF-8 encoding. But don't let Reality interfere with your design!
*****
>
>In those rare cases where a single utf8string has differing
>length bytes sequences representing CodePoints, then the
> std::vector<unsigned int> Index;
>is derived. This std::vector stores the subscript within
>Data where each CodePoint begins. It is derived once during
>construction which is when validation occurs. A flag value
>of Zero is assigned to BytePerCodepoint indicates that the
>Index is needed.
****
And this is "faster" exactly HOW? It uses "less space" exactly HOW?

Sounds to me like a horrible kludge that solves a problem that should not need to exist.

And in looking at the encodings, I have found more exceptions to the encodings than the
number of regular expressions given, so either some letters are not being included or some
non-letters ARE being included. But hey, apparently this regexp ws found "On the
Internet" so it MUST be correct! I already pointed out in an earlier post why it is
almost certainly deeply flawed.
>
>For the ASCII character set the use of utf8string is just as
>fast and uses hardly no more space than std::string. For
>other character sets utf8string is most often just as fast
>as std::string, and only uses a minimal increment of
>additional space only when needed. Even Chinese most often
>only takes three bytes.
****
And this handles these encodings HOW? by assuming they all take 3 bytes?
joe
****
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish