New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Joseph M. Newcomer on 17 May 2010 12:13

See below...
On Mon, 17 May 2010 08:20:39 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 10:29 PM, Joseph M. Newcomer wrote:
>> See beow...
>> On Sun, 16 May 2010 09:46:16 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 5/16/2010 8:51 AM, �� Tiib wrote:
>>>> On 16 mai, 15:34, "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote:
>>>>> Since the reason for using other encodings than UTF-8 is
>>>>> speed and ease of use, a string that is as fast and easy to
>>>>> use (as the strings of other encodings) that often takes
>>>>> less space would be superior to alternative strings.
>>>>
>>>> If you care so much ... perhaps throw together your utf8string and let
>>>> us to see it. Perhaps test& profile it first to compare with
>>>> Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html
>>>>
>>>> I suspect UTF8 fades gradually into history. Reasons are similar like
>>>> 256 color video-modes and raster-graphic formats went. GUI-s are
>>>> already often made with java or C# (for lack of C++ devs) and these
>>>> use UTF16 internally. Notice that modern processor architectures are
>>>> already optimized in the way that byte-level operations are often
>>>> slower.
>>>
>>> UTF-8 is the best Unicode data-interchange format because it works
>>> exactly the same way across every machine architecture without the need
>>> for separate adaptations. It also stores the entire ASCII character set
>>> in a single byte per code point.
>> ****
>> How do we make the leap from "best data interchange format" to "best internal
>> representation"? I fail to see the correlation here. Or why a parser for a C-like
>> language needs to "save space" by foolish examples of optimiztion. THis issue does not
>> become important until the files start approaching the gigabyte range.
>> ****
>
>My proposed solution would be much more efficient when doing a string
>search on a large text file encoded as UTF-8.
****
No, it doesn't even work if you have UTF-32. For example, suppose I want to seach for
�
I can search for U000000E4 or the pair U00000308 U00000061

So I fail to see how you can want or care about doing string search in a text file in
UTF-8 when it has to be complex even for a "flat" UTF-32 file. And why in the world
would you want to search the UTF-8 file? That is not the job of a compiler.
****
>
>>>
>>> I will put it together because it will become one of my standard tools.
>>> The design is now essentially complete. Coding this updated design will
>>> go very quickly. I will put it on my website and provide a free license
>>> for any use as long as the copyright notice remains in the source code.
>>>
>> *****
>> And if you do not measure its performance, and express that in terms of time and space,
>> and demonstrate that it runs no slower and consumes enough less space to make a
>> difference, it is all a colossal waste of time. Most of us know that it will be slower
>> and either consume insignificantly less space or, given the need of the index vector,
>> vastly more space (thus also making it slower, because all accesses must be mediated by
>> the index vector), or end up being measured just in the trivial subcase of 8-bit character
>> input (not an important measure), the whole design seems just flat-out wrong.
>> joe
>
>I just provided one concrete example above that proves the superiority
>of this design at least for the example that I provided. For text string
>search where the data is encoded as UTF-8 my proposed solution would be
>much faster because no conversion to and from UTF-8 is required.
>
>> ****
>> joe
>> *****
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Jonathan Lee on 17 May 2010 12:57

On May 16, 8:34 am, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
> In those rare cases where a single utf8string has differing
> length bytes sequences representing CodePoints, then the
> std::vector<unsigned int> Index;
> is derived. This std::vector stores the subscript within
> Data where each CodePoint begins.

Wouldn't this require 4 bytes per character (reasonably
assuming sizeof(unsigned int) == 4)? OR you'd have to use
a unsigned short or something, bringing your max string
length down accordingly.

You may as well store the string in UTF-32, and provide
an optimization for the case where all unicode characters
are ASCII.

--Jonathan

From: Peter Olcott on 17 May 2010 14:42

On 5/17/2010 10:48 AM, Joseph M. Newcomer wrote:
> The underlying technology is discussed in the Unicode documentation and on
> www.unicode.org. There are a set of APIs that deliver character information including the
> class information which are part of the Unicode support in Windows. But the point is,
> thinking of Unicode code points by writing a regexp for UTF-8 is not a reasonable
> approach.
>
> Or to put it bluntly, the regexp set you show is wrong, I have shown it is wrong, and you
> have to start thinking correctly about the problem.
> joe

No you did not show that it was wrong for its intended purpose of
validating byte sequences as valid UTF-8 and dividing these sequences
into their corresponding code points.

You merely provided examples of things that it was not intended to do,
which in no way shows that it is in any way incorrect when measured
against its intended purpose.

From: I V on 17 May 2010 15:25

On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
> Do you know of any faster way to validate and divide a UTF-8 sequence
> into its constituent code point parts than a regular expression
> implemented as a finite state machine? (please don't cite a software
> package, I am only interested in the underlying methodology).

A finite state machine sounds like a good plan, but I'd be a bit
surprised if a regular expression was faster than a state machine
specifically written to parse UTF-8. Aside from the unnecessary
generality of regular expressions (I don't really know if that would
actually make them slower in this case), I would guess a regular
expression engine wouldn't take advantage of the way that UTF-8 encodes
the meaning of each byte (single-byte codepoint, first byte of multi-byte
code-point, or continuation of a multi-byte codepoint) in the most-
significant two bits of the byte.

From: Peter Olcott on 17 May 2010 15:43

On 5/17/2010 2:25 PM, I V wrote:
> On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
>> Do you know of any faster way to validate and divide a UTF-8 sequence
>> into its constituent code point parts than a regular expression
>> implemented as a finite state machine? (please don't cite a software
>> package, I am only interested in the underlying methodology).
>
> A finite state machine sounds like a good plan, but I'd be a bit
> surprised if a regular expression was faster than a state machine
> specifically written to parse UTF-8. Aside from the unnecessary
> generality of regular expressions (I don't really know if that would
> actually make them slower in this case), I would guess a regular
> expression engine wouldn't take advantage of the way that UTF-8 encodes
> the meaning of each byte (single-byte codepoint, first byte of multi-byte
> code-point, or continuation of a multi-byte codepoint) in the most-
> significant two bits of the byte.

I was originally thinking That I would only need 256 * 4 bytes to encode
a complete DFA recognizer using the simplest possible design. Now it
looks like I need 256 ^ 4 bytes to encode the simplest possible design.
Apparently Lex knows how to make this much more concise, thus not the
simplest possible design.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish