New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Peter Olcott on 17 May 2010 23:35

On 5/17/2010 4:58 PM, Joshua Maurice wrote:
> On May 17, 12:25 pm, I V<ivle...(a)gmail.com> wrote:
>> On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
>>> Do you know of any faster way to validate and divide a UTF-8 sequence
>>> into its constituent code point parts than a regular expression
>>> implemented as a finite state machine? (please don't cite a software
>>> package, I am only interested in the underlying methodology).
>>
>> A finite state machine sounds like a good plan, but I'd be a bit
>> surprised if a regular expression was faster than a state machine
>> specifically written to parse UTF-8. Aside from the unnecessary
>> generality of regular expressions (I don't really know if that would
>> actually make them slower in this case), I would guess a regular
>> expression engine wouldn't take advantage of the way that UTF-8 encodes
>> the meaning of each byte (single-byte codepoint, first byte of multi-byte
>> code-point, or continuation of a multi-byte codepoint) in the most-
>> significant two bits of the byte.
>
> This sounds a little overkill to me, all of this talk of regular
> expressions, finite state machines, etc.
>
> Can't you just do something like the following? I understand that it
> is a finite state machine in fact, but it uses no frameworks, no
> regular expressions, etc. I'd expect that this is pretty good in terms
> of speed and readability. It would be quite simple to add some code
> using bit operations to convert from the utf8 array to Unicode code
> points.

The finite state machine's detailed design is now completed. Its state
transition matrix only takes 2048 bytes. It will be faster than any
other possible method.

The finite state machine is semantically identical to the regular
expression. I am somewhat enamored with DFA recognizers. I love them. I
will post the source code when it is completed.

A few years ago I beat both Microsoft and Borland's std::string by as
much as more than 100-fold. I posted the code this this group. It was
called FastString if anyone cares to look this up.

>
> //COMPLETELY UNTESTED
> bool validate_utf8(unsigned char * utf8str_start, unsigned char *
> utf8str_end)
> {
> for (unsigned char * x = utf8str_start; x != utf8str_end; )
> {
> if ((*x& 0x80) == 0)
> {
> ++x;
> }
> else if ((*x& (0x80 + 0x40 + 0x20)) == (0x80 + 0x40))
> {
> if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
> return false;
> ++x;
> }
> else if ((*x& (0x80 + 0x40 + 0x20 + 0x10)) == (0x80 + 0x40 +
> 0x20))
> {
> if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
> return false;
> if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
> return false;
> ++x;
> }
> else if ((*x& (0x80 + 0x40 + 0x20 + 0x10 + 0x08)) == (0x80 + 0x40
> + 0x20 + 0x10))
> {
> if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
> return false;
> if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
> return false;
> if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
> return false;
> ++x;
> } else
> return false;
> }
> return true;
> }

From: Hector Santos on 18 May 2010 01:31

Peter Olcott wrote:

> A few years ago I beat both Microsoft and Borland's std::string by as
> much as more than 100-fold. I posted the code this this group. It was
> called FastString if anyone cares to look this up.

Oh Stop it. That thread only ended up like all your messages -
proving you wrong.

--
HLS

From: Joshua Maurice on 18 May 2010 01:40

On May 17, 8:35 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> The finite state machine's detailed design is now completed. Its state
> transition matrix only takes 2048 bytes. It will be faster than any
> other possible method.
>
> The finite state machine is semantically identical to the regular
> expression. I am somewhat enamored with DFA recognizers. I love them. I
> will post the source code when it is completed.

I'd love to see it (and test it). I'd be a little surprised that
anything using something called "regular expressions" could beat the
code I just posted in terms of speed or executable size. (Unless
perhaps the regex library itself outputted c code, which was then
compiled down. I could probably buy it then.)

> A few years ago I beat both Microsoft and Borland's std::string by as
> much as more than 100-fold. I posted the code this this group. It was
> called FastString if anyone cares to look this up.

Beat them in what? What operations? I hope you aren't comparing copy-
on-write string to a copy-on-copy implementation. (Is there a better
name for not "copy on write"?)

From: Mihai N. on 18 May 2010 03:36

> It will be faster than any other possible method.

Not only "the fastest method today" but faster than any possible method.
Really?
Never make such absolute statements.
Looks like a sure sign of delusion.
No matter how fast the method, someone, at some point, will have something
faster. It's just how things work.

> I will post the source code when it is completed.
You do that.
That way it is easyer to measure and compare with other methods.
Estimates mean nothing.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on 18 May 2010 03:38

> //COMPLETELY UNTESTED
Then most likely wrong :-)

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish