Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Mihai N. on 15 May 2010 06:21

> Do you know anywhere where I can get a table that maps all
> of the code points to their category?

ftp://ftp.unicode.org/Public/5.2.0/ucd

UnicodeData.txt
The main guide for that is ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
(if you don't want to go thru the standard, which is the adviseable thing)

And when you bump your head, remeber that joe and I warned you about utf-8.
It was not designed for this kind of usage.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Peter Olcott on 15 May 2010 10:12

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D7922352F422MihaiN(a)207.46.248.16...
>
>> Do you know anywhere where I can get a table that maps
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd
>
> UnicodeData.txt
> The main guide for that is
> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
> (if you don't want to go thru the standard, which is the
> adviseable thing)
>
> And when you bump your head, remeber that joe and I warned
> you about utf-8.
> It was not designed for this kind of usage.
>
>
Joe also said that UTF-8 was designed for data interchange
which is how I will be using it. Joe also falsely assumed
that I would be using UTF-8 for my internal representation.
I will be using UTF-32 for my internal representation.

I will be using UTF-8 as the source code for my language
interpreter, which has the advantage of simply being ASCII
for the English language, and working across every platform
without requiring adaptations such as Little Endian and Big
Endian. UTF-8 will also be the output of my OCR4Screen DFA
recognizer.

>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Peter Olcott on 15 May 2010 11:08

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D7922352F422MihaiN(a)207.46.248.16...
>
>> Do you know anywhere where I can get a table that maps
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd
>

What I am looking for is a mapping between Unicode code
points (compressed into code point ranges when possible)
that maps to General Category Values as two character
abbreviations. I will look though this first link to see if
I can find this. Initially I saw a lot of things that were
not this.

> UnicodeData.txt
> The main guide for that is
> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
> (if you don't want to go thru the standard, which is the
> adviseable thing)
>
> And when you bump your head, remeber that joe and I warned
> you about utf-8.
> It was not designed for this kind of usage.
>
>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Peter Olcott on 15 May 2010 11:48

From: Joseph M. Newcomer on 17 May 2010 00:03

How about a non-answer is a substitute for "this is the most incredibly stupid idea I have
seen in decades, and I'm not going to waste my time pointing out the obvious silliness of
it"?

You are again spending massive effort to solve an artificial problem of your own creation,
caused by making poor initial design choices, and supported by nonsensical
rationalizations. A professional programmer knows certain patterns (that is our
strength!) and among these are the recognition that if you have to implement complex
solutions to simple problems, you have made a bad design choice and are best served by
re-examining the design choices and making design choices that eliminate the need for
complex solutions, particularly when the complexity simply goes away if a different set of
solutions is postulated.

Personally, if I had to do a complex parser design, I'd want to eliminate the need to deal
with UTF-16 surrogates, and I'd write my code in terms of UTF-32. Much simpler, and
isolates the complexity and the input and output edges, not making it uniformly
distributed throughout the code. And I'd know not to make childish decisions such as "it
costs too much to do the conversion" because I outgrew those kinds of arguments certainly
by 1980 (that's thirty years ago). My first instance of this was a typesetting program I
did around 1970 where I stored the text as 9-bit rather than 7-bit bytes because I could
encode font informtion more readily in the upper two bits. And I didn't even CONSIDER the
size and performance issues of 9-bit vs. 7-bit bytes because I knew they didn't matter in
the slightest. So I guess I learned this lesson 40 years ago. It greatly simplified the
internal coding.

But you are sounding like a first-semester programmer who was taught by some old PDP-11
programmer, and I don't buy either the size or the conversion performance arguments. You
don't even have NUMBERS to argue your position! Optimization decisions that are argued
without quantitative supporting measurments are almost always wrong. But we've had this
discussion before, and your view is "My mind is made up, don't require me to get FACTS to
support my decision!" In the Real World, before we can justify wasting lots of programmer
time to implement bad decisions, we require justification. But maybe that's just my
project management experience talking. Horrible, this dependence on reality that I have.

If someone came to me with such a design, and was as insistent as you will be, my first
requirement would be "Write a program that reads UTF-8 files of the expected size, then
writes them back out. Measure its performance reading several dozen different files, and
run each experiment 100 times, measuring the time-to-completion". Then "modify the
program to convert the data to UTF-16, convert it back to UTF-8, and run the same
experiment sent. Demonstrate that the change in the mean time is statistically
significant". Hell, the variation of LOADING the PROGRAM Is going to differ from
experiment to experiment by a variance several orders of magnitude greater than the
conversion cost! So don't try to make the case that the conversion cost matters; the
truth, based on actual performance measurements end-to-end, is that it does not. But, not
having actually done performance measurement, you don't understand that. Those of us who
devoted nontrivial parts of our lives to optimizing program performance KNOW what the
problems are, and know that the conversion cannot possibly matter.
joe
*****
joe
****
On Fri, 14 May 2010 11:53:30 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
>news:uU4O0P48KHA.1892(a)TK2MSFTNGP05.phx.gbl...
>>
>> "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1(a)4ax.com...
>>> Actually, what it does is give us another opportunity to
>>> point how how really bad this
>>> design choice is, and thus Peter can tell us all we are
>>> fools for not answering a question
>>> that should never have been asked, not because it is
>>> inappropriate for the group, but
>>> because it represents the worst-possible-design decision
>>> that could be made.
>>> joe
>>
>> Come on Joe, give Mr. Olcott some credit. I'm sure that he
>> could dream up an even worse design as he did with his OCR
>> project once he is given (and ignores) input from the
>> professionals whos input he claims to seek. ;)
>>
>>
>> -Pete
>>
>>
>
>Most often I am not looking for "input from professionals",
>I am looking for answers to specific questions.
>
>I now realize that every non-answer response tends to be a
>mask for the true answer of "I don't know".
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks