New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Peter Olcott on 17 May 2010 09:20

On 5/16/2010 10:29 PM, Joseph M. Newcomer wrote:
> See beow...
> On Sun, 16 May 2010 09:46:16 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/16/2010 8:51 AM, �� Tiib wrote:
>>> On 16 mai, 15:34, "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote:
>>>> Since the reason for using other encodings than UTF-8 is
>>>> speed and ease of use, a string that is as fast and easy to
>>>> use (as the strings of other encodings) that often takes
>>>> less space would be superior to alternative strings.
>>>
>>> If you care so much ... perhaps throw together your utf8string and let
>>> us to see it. Perhaps test& profile it first to compare with
>>> Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html
>>>
>>> I suspect UTF8 fades gradually into history. Reasons are similar like
>>> 256 color video-modes and raster-graphic formats went. GUI-s are
>>> already often made with java or C# (for lack of C++ devs) and these
>>> use UTF16 internally. Notice that modern processor architectures are
>>> already optimized in the way that byte-level operations are often
>>> slower.
>>
>> UTF-8 is the best Unicode data-interchange format because it works
>> exactly the same way across every machine architecture without the need
>> for separate adaptations. It also stores the entire ASCII character set
>> in a single byte per code point.
> ****
> How do we make the leap from "best data interchange format" to "best internal
> representation"? I fail to see the correlation here. Or why a parser for a C-like
> language needs to "save space" by foolish examples of optimiztion. THis issue does not
> become important until the files start approaching the gigabyte range.
> ****

My proposed solution would be much more efficient when doing a string
search on a large text file encoded as UTF-8.

>>
>> I will put it together because it will become one of my standard tools.
>> The design is now essentially complete. Coding this updated design will
>> go very quickly. I will put it on my website and provide a free license
>> for any use as long as the copyright notice remains in the source code.
>>
> *****
> And if you do not measure its performance, and express that in terms of time and space,
> and demonstrate that it runs no slower and consumes enough less space to make a
> difference, it is all a colossal waste of time. Most of us know that it will be slower
> and either consume insignificantly less space or, given the need of the index vector,
> vastly more space (thus also making it slower, because all accesses must be mediated by
> the index vector), or end up being measured just in the trivial subcase of 8-bit character
> input (not an important measure), the whole design seems just flat-out wrong.
> joe

I just provided one concrete example above that proves the superiority
of this design at least for the example that I provided. For text string
search where the data is encoded as UTF-8 my proposed solution would be
much faster because no conversion to and from UTF-8 is required.

> ****
> joe
> *****
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 17 May 2010 09:27

On 5/16/2010 10:40 PM, Joseph M. Newcomer wrote:
> On Sun, 16 May 2010 21:40:58 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/16/2010 3:12 PM, I V wrote:
>>> On Sun, 16 May 2010 07:34:11 -0500, Peter Olcott wrote:
>>>> Since most (if not all) character sets always have a consistent number
>>>> of BytePerCodepoint, this value can be used to quickly get to any
>>>
>>> I'm not sure that's true. Certainly, the Latin character set(s) that one
>>> would use for most Western languages mixes one- and two-byte characters
>>> (think again of James Kanze's example of na�ve). Non-Latin character sets
>>> would also frequently use ASCII punctuation (e.g. writing in Greek and
>>> Cyrillic).
>>
>> Yes those would be exceptions.
>>
>>>
>>>> In those rare cases where a single utf8string has differing length bytes
>>>> sequences representing CodePoints, then the
>>>> std::vector<unsigned int> Index;
>>>> is derived. This std::vector stores the subscript within Data where each
>>>> CodePoint begins. It is derived once during construction which is when
>>>> validation occurs. A flag value of Zero is assigned to BytePerCodepoint
>>>> indicates that the Index is needed.
>>>
>>> Note that in this case you are storing an integer per character; this is
>>> likely to be four bytes, plus at least one byte for the character itself,
>>> that is, one more byte than if you had just translated into UTF-32.
>>
>> Yes but it depends on how often this is needed. Even if it is needed All
>> the time, we still have the advantage of speed. Because almost
>> everything (especially including I/O) requires no conversion the
>> utf8string may be faster to the extent that conversions are eliminated.
>> Most every operation takes about the same time as std::string.
> ****
> Peter, stop focussing on such silly concepts as conversion time mattering in the
> slightest, and start giving RATIONAL reasons for your design decisions. THis one makes no
> sense. There will be ORDERS OF MAGNITUDE greater differences in input time if you take
> rotational latency and seek time into consideration (in fact, opening the file will have
> orders of magnitude more variance than the cost of a UTF-8 to UTF-16 or even UTF-32
> conversion, because of the directory lookup time variance). So you are saying that you

This statement seems absurd to me, can you explain your reasoning?

> will save some fraction of a tenth of a percent of overall performance by not converting
> to UTF-16. At this point, anyone who has ever realisitcally done performance optimization
> is rolling on the floor at the failure to understand where the real problems are. You
> have to write and debug some complex class, and write all your code in terms of this
> class, and implement complex and probably erroneous regexps to handle the cases, to save
> an unmeasuably small amount of time at the input and output edges of your code? Get real!

Since it is my understanding that the fastest and simplest possible way
to validate and divide any UTF=8 sequence into its constituent code
point parts is a regular expression implemented as a finite state
machine this statement would seem to be erroneous. A valid
counter-example would invalidate my statement.

>
> Why is it you keep inventing complex and unnecessary solutions to simple problems, and
> keep giving justifications that don't even make sense? "time" and "space" arguments are
> not credible here, because they are optimizing parameters that are so minute as to be
> within a fraction of a standard deviation of actual measured performance, or as those of
> use who used to worry about these things a LOT used to say "it is lost in the noise".

This is probably not the case for text search where text is encoded as
UTF-8. In this case the conversion time would likely be significant.

>
> We used to worry about optimizing things that actually MATTERED.
> joe
>
>
>>
>> The primary reason that I am focusing on UTF-8 is that I want to
>> internationalize I/O. The scripting language encoding will be UTF-8 so
>> that everyone can write scripts in their native language character set.
>> I must also provide internationalized output from my OCR4Screen
>> character recognition engine.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Leigh Johnston on 17 May 2010 09:30

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:_redna1LAMZaomzWnZ2dnUVZ_tOdnZ2d(a)giganews.com...
> On 5/17/2010 1:35 AM, Mihai N. wrote:
>>
>>> I studied the derivation of the above regular expression in considerable
>>> depth. I understand UTF-8 encoding quite well. So far I have found no
>>> error.
>>
>>> It is published on w3c.org.
>>
>> Stop repeating this nonsense.
>>
>> The URL is http://www.w3.org/2005/03/23-lex-U and the post states:
>> "It is not endorsed by the W3C members, team, or any working group."
>>
>> It is a hack implemented by someone and it happens to be on the w3c
>> server.
>> This is not enough to make it right. If I post something on the free
>> blogging
>> space offered by Microsoft, you will take as law and say "it is published
>> on microsoft.com?
>>
>>
> Do you know of any faster way to validate and divide a UTF-8 sequence into
> its constituent code point parts than a regular expression implemented as
> a finite state machine? (please don't cite a software package, I am only
> interested in the underlying methodology).
>
> To the very best of my knowledge (and I have a patent on a finite state
> recognizer) a regular expression implemented as a finite state machine is
> the fastest and simplest possible way of every way that can possibly exist
> to validate a UTF-8 sequence and divide it into its constituent parts.

My utf8_to_wide free function is not a finite state machine and it is pretty
fast. It takes a std::string as input and returns a std::wstring as output.
KISS.

/Leigh

From: Peter Olcott on 17 May 2010 10:29

On 5/17/2010 1:44 AM, Joseph M. Newcomer wrote:
>> If you are not a liar then show an error in the above regular
>> expression, I dare you.
> ***
> I have already pointed out that it is insufficient for lexically recognizing accent marks
> or invalid combinations of accent marks. So the requirement of demonstrating an error is
> trivially met.
>
> In addition, the regexp values do not account for directional changes in the parse, which
> is essential, for reasons I explained in another response.

I have always defined correct to mean valid UTF-8 sequences (according
to the UTF-8 specification) and now you are presenting the red-herring
that it does not validate code point sequences. It is not supposed to
validate code point sequences.

The reason that I ALWAYS ask you to explain your reasoning is that this
most often provides the invalid assumptions that you are making.

>
> It would be easier if you had expressed it as Unicode codepoints; then it would be easy to
> show the numerous failures. I'm sorry, I thought you had already applied exhaustive
> categorical reasoning to this, which would have demonstrated the errors.

This level of detail is not relevant to the specific problem that I am
solving. The problem is providing a minimal cost way to permit people to
write GUI scripts in their native language. Anything that goes beyond
the scope of this problem is explicitly out-of-scope.

From: Peter Olcott on 17 May 2010 10:36

On 5/17/2010 3:12 AM, Oliver Regenfelder wrote:
> Hello,
>
> Joseph M. Newcomer wrote:
>>>> utf8string handles all of the conversions needed transparently. Most
>>>> often no conversion is needed. Because of this it is easier to use
>>>> than the methods that you propose. It always works for any character
>>>> set with maximum speed, and less space.
>>> Like I said your solution is suboptimal, working with std::string and
>>> std::wstring and providing free functions to convert between
>>> different encoding formats is not suboptimal especially when the host
>>> operating system's native Unicode encoding is not UTF-8.
>> ****
>> Now, now, you are trying to be RATIONAL! This never works.
>> joe
>
> I think you tried that with Peter yourself for a very long time.
>
> Best regards,
>
> Oliver

In any case the counter example showing that it is quite often the case
that differing length strings can occur, such as mixing ASCII
punctuation with other non-ASCII characters shows that using UTF-8 as
the internal data representation is not the best way.

And Joe's pointing out that the conversion costs between UTF-8 and
UTF-32 are trivial compared to the I/O costs also added to the
confirmation of the original decision to use UTF-32 as the internal
representation.

This thread was to play the Devil's advocate to gain a deep
understanding of the reasoning behind the use of the differing encodings.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish