New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: James Kanze on 18 May 2010 10:34

On 17 May, 14:08, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/17/2010 1:35 AM, Mihai N. wrote:
> Do you know of any faster way to validate and divide a UTF-8 sequence
> into its constituent code point parts than a regular expression
> implemented as a finite state machine? (please don't cite a software
> package, I am only interested in the underlying methodology).

If the goal is only to get to the start of the next character,
something like:
p += byteCount(*p);
is a lot faster; you don't even have to look at the intermediate
characters. If you don't know the validity of the UTF-8, and
want to check that at the same time, then I'd still start with
something like that. For checking for non-canonical
representations and such, regular expressions might be useful,
but they're likely to be a bit more complicated than what you
present.

> To the very best of my knowledge (and I have a patent on
> a finite state recognizer)

You have a patent on finite state machines? I suspect that
you'll find that prior experience will invalidate it.

> a regular expression implemented as a finite state machine is
> the fastest and simplest possible way of every way that can
> possibly exist to validate a UTF-8 sequence and divide it into
> its constituent parts.

It all depends on the formal specification; one of the
characteristics of UTF-8 is that you don't have to look at every
character to find the length of a sequence. And a regular
expression generally will have to look at every character.

--
James Kanze

From: James Kanze on 18 May 2010 10:37

On 17 May, 14:30, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote in message

> news:_redna1LAMZaomzWnZ2dnUVZ_tOdnZ2d(a)giganews.com...
> > On 5/17/2010 1:35 AM, Mihai N. wrote:

> My utf8_to_wide free function is not a finite state machine
> and it is pretty fast. It takes a std::string as input and
> returns a std::wstring as output. KISS.

Every program is a finite state machine:-).

Seriously, I'm with you on this one. UTF-8 was designed to be
very simple to convert to and from other Unicode formats.
A regular expression is just extra complications, for no real
gain. (My own UTF-8 handling doesn't use regular expressions
either, although I do have a RegularExpression class which works
on UTF-8 input, directly.)

--
James Kanze

From: James Kanze on 18 May 2010 10:46

On 18 May, 06:40, Joshua Maurice <joshuamaur...(a)gmail.com> wrote:
> On May 17, 8:35 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> > The finite state machine is semantically identical to the
> > regular expression. I am somewhat enamored with DFA
> > recognizers. I love them. I will post the source code when
> > it is completed.

> I'd love to see it (and test it). I'd be a little surprised
> that anything using something called "regular expressions"
> could beat the code I just posted in terms of speed or
> executable size. (Unless perhaps the regex library itself
> outputted c code, which was then compiled down. I could
> probably buy it then.)

A strict regular expression, without extensions (like capture
and such) defines a DFA. You can compile it to one fairly
simply, then output the tables as C++ code. (My own regular
expression classes support this.) It will still normally
require indexing into two tables (one with two dimensions) per
byte. That's two or three memory accesses, in addition to
accessing the character. As you doubtlessly know, UTF-8 was
designed expressedly so that this is not necessary; the byte
itself carries all the necessary information, so no further
accesses are necessary. (I find it faster to use a table
lookup that to calculate the sequence length manually. But
that's one 256 byte table, indexed by the first byte.)

And as we know, memory bandwidth is usually the limiting factor
of CPU based code.

> > A few years ago I beat both Microsoft and Borland's
> > std::string by as much as more than 100-fold. I posted the
> > code this this group. It was called FastString if anyone
> > cares to look this up.

> Beat them in what? What operations? I hope you aren't
> comparing copy-on-write string to a copy-on-copy
> implementation. (Is there a better name for not "copy on
> write"?)

I think it's usually called deep copy.

Note that if you accept dropping some of the functionality (like
being able to modify the string through an iterator or an index
operator), you can make your string class a lot faster.

--
James Kanze

From: James Kanze on 18 May 2010 11:12

On 17 May, 22:58, Joshua Maurice <joshuamaur...(a)gmail.com> wrote:
> On May 17, 12:25 pm, I V <ivle...(a)gmail.com> wrote:

[...]
> Can't you just do something like the following? I understand that it is
> a finite state machine in fact, but it uses no frameworks, no regular
> expressions, etc. I'd expect that this is pretty good in terms of speed and
> readability. It would be quite simple to add some code using bit operations
> to convert from the utf8 array to Unicode code points.

> //COMPLETELY UNTESTED
> bool validate_utf8(unsigned char * utf8str_start, unsigned char *
> utf8str_end)
> {
> for (unsigned char * x = utf8str_start; x != utf8str_end; )
> {
> if ((*x & 0x80) == 0)
> {
> ++x;
> }
> else if ((*x & (0x80 + 0x40 + 0x20)) == (0x80 + 0x40))
> {
> if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
> return false;
> ++x;
> }
> else if ((*x & (0x80 + 0x40 + 0x20 + 0x10)) == (0x80 + 0x40 +
> 0x20))
> {
> if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
> return false;
> if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
> return false;
> ++x;
> }
> else if ((*x & (0x80 + 0x40 + 0x20 + 0x10 + 0x08)) == (0x80 + 0x40
> + 0x20 + 0x10))
> {
> if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
> return false;
> if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
> return false;
> if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
> return false;
> ++x;
> } else
> return false;
> }
> return true;
> }

First, this doesn't actually "validate" UTF-8, since it accepts
non-minimal sequences, encodings for surrogates, etc. I'm not
sure that this is really an issue, however, since you'd normally
only do such validation on input (when IO speed dominates). For
the rest, I'd still use a loop:

int byteCount = byteCountTable[*p];
if (byteCount == 0) {
error...
} else {
++ p;
-- byteCount;
while (byteCount != 0) {
if ((*p & 0xC0) != 0x80) {
error...
}
++ p;
-- byteCount;
}
}

I don't know if it's faster or slower than yours (although both
are almost certainly faster than a DFA), but you can't get much
simpler. (Note that you can easily modify it to generate UTF-32
on the fly, almost as quickly, thus killing two birds with one
stone.)

My complete conversion routine is:

//!@cond implementation
struct EncodingInfos
{
CodePoint limitValue ;
Byte firstBytePrefix ;
Byte firstByteMask ;
} ;
extern EncodingInfos const
infoTable[ 7 ] ;
extern size_t const byteCountTable[] ;

Byte const nextBytePrefix = 0x80U ;
Byte const nextByteDataMask = 0x3FU ;
Byte const nextBytePrefixMask = 0xC0U ;
int const nextByteShift = 6 ;
//!@endcond

template< typename InputIterator >
CodePoint
codePoint(
InputIterator begin,
InputIterator end )
{
size_t byteCount = begin != end
? size( *begin )
: 0 ;
EncodingInfos const* const
info = infoTable + byteCount ;
CodePoint result = byteCount > 0
? *begin ++ & info->firstByteMask
: error ;

while ( result != error && -- byteCount > 0 ) {
if ( begin == end
|| (*begin & nextBytePrefixMask) != nextBytePrefix ) {
result = error ;
} else {
result = (result << nextByteShift)
| (*begin ++ & nextByteDataMask) ;
}
}
if ( result != error ) {
if ( result < (info - 1)->limitValue || result >= info-
>limitValue ) {
result = error ;
}
}
return result ;
}

Note that this still lets encodings for surrogates through, but
it catches most of the rest. (And testing for surrogates is
simple once you have the UTF-32.)

--
James Kanze

From: Joseph M. Newcomer on 18 May 2010 11:41

I worked with lex many years ago. It is possible to compile the DFA into executable code
that is blindingly fast, but I think even someone like Nigel Horspool, who wrote the
world's fastest parser close to 20 years ago (a million lines a minute on a Mac II
(68030)) might have been hard-pressed to implement that complex regexp in 20 clock cycles,
that is, no more than 40 instructions per complete match. Nigel (University of Victoria,
BC) is possibly the world's expert on fast parsing. His code was generated automatically
from lex and yacc output.
joe

On Tue, 18 May 2010 16:26:27 +0200, Oliver Regenfelder <oliver.regenfelder(a)gmx.at> wrote:

>Hello,
>
>Peter Olcott wrote:
>> I completed the detailed design on the DFA that would validate and
>> translate any valid UTF-8 byte sequence into UTF-32. It can not be done
>> faster or simpler. The state transition matrix only takes exactly 2 KB.
>
>Who cares about DFAs and state transition matrix sizes when all you want
>to do is convert UTF-8 to UTF-32. That are some if/else and switch
>statements in your programming language of choice + error handling.
>
>Best regards,
>
>Oliver
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish