From: David Schwartz on
On May 13, 3:04 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.
>
> Your question is not even indirectly related to the C++
> language.

Unfortunately, no better way is known to keep conversations on topic.
If you know a better way, we'd all love to hear it. If you don't
respond immediately in the forum and point out that something is off
topic, other people browsing the forum will think the question was on
topic. Other ways have been tried in the past (such as private mails
where possible, monthly posts about topicality rather than replying to
each off-topic post, and so on). None have been shown to be effective.

Painful experience has shown that the most effective technique is to
verbally berate and ridicule people who post off topic. Thus others
will see the negative response by the group and now want their posts
to be met with a similar response.

Again, this wasn't anyone's first choice, and if you know a better
way, please tell us. (In the appropriate forum, of course!)

DS
From: Peter Olcott on
On 5/16/2010 11:03 PM, Joseph M. Newcomer wrote:
> How about a non-answer is a substitute for "this is the most incredibly stupid idea I have
> seen in decades, and I'm not going to waste my time pointing out the obvious silliness of
> it"?
>
> You are again spending massive effort to solve an artificial problem of your own creation,
> caused by making poor initial design choices, and supported by nonsensical
> rationalizations. A professional programmer knows certain patterns (that is our
> strength!) and among these are the recognition that if you have to implement complex
> solutions to simple problems, you have made a bad design choice and are best served by
> re-examining the design choices and making design choices that eliminate the need for
> complex solutions, particularly when the complexity simply goes away if a different set of
> solutions is postulated.
>
> Personally, if I had to do a complex parser design, I'd want to eliminate the need to deal
> with UTF-16 surrogates, and I'd write my code in terms of UTF-32. Much simpler, and
> isolates the complexity and the input and output edges, not making it uniformly
> distributed throughout the code. And I'd know not to make childish decisions such as "it
> costs too much to do the conversion" because I outgrew those kinds of arguments certainly
> by 1980 (that's thirty years ago). My first instance of this was a typesetting program I
> did around 1970 where I stored the text as 9-bit rather than 7-bit bytes because I could
> encode font informtion more readily in the upper two bits. And I didn't even CONSIDER the
> size and performance issues of 9-bit vs. 7-bit bytes because I knew they didn't matter in
> the slightest. So I guess I learned this lesson 40 years ago. It greatly simplified the
> internal coding.
>
> But you are sounding like a first-semester programmer who was taught by some old PDP-11
> programmer, and I don't buy either the size or the conversion performance arguments. You
> don't even have NUMBERS to argue your position! Optimization decisions that are argued
> without quantitative supporting measurments are almost always wrong. But we've had this
> discussion before, and your view is "My mind is made up, don't require me to get FACTS to
> support my decision!" In the Real World, before we can justify wasting lots of programmer
> time to implement bad decisions, we require justification. But maybe that's just my
> project management experience talking. Horrible, this dependence on reality that I have.
>
> If someone came to me with such a design, and was as insistent as you will be, my first
> requirement would be "Write a program that reads UTF-8 files of the expected size, then
> writes them back out. Measure its performance reading several dozen different files, and
> run each experiment 100 times, measuring the time-to-completion". Then "modify the
> program to convert the data to UTF-16, convert it back to UTF-8, and run the same
> experiment sent. Demonstrate that the change in the mean time is statistically
> significant". Hell, the variation of LOADING the PROGRAM Is going to differ from
> experiment to experiment by a variance several orders of magnitude greater than the
> conversion cost! So don't try to make the case that the conversion cost matters; the
> truth, based on actual performance measurements end-to-end, is that it does not. But, not
> having actually done performance measurement, you don't understand that. Those of us who
> devoted nontrivial parts of our lives to optimizing program performance KNOW what the
> problems are, and know that the conversion cannot possibly matter.
> joe

You probably have a point here. My "devil's advocate" counter argument
is showing up all of the nuances of the alternative design decisions.

Where I am going to be able to talk to you when Microsoft shuts down the
microsoft.public.* hierachy?

> *****
> joe
> ****
> On Fri, 14 May 2010 11:53:30 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>
>>
>> "Pete Delgado"<Peter.Delgado(a)NoSpam.com> wrote in message
>> news:uU4O0P48KHA.1892(a)TK2MSFTNGP05.phx.gbl...
>>>
>>> "Joseph M. Newcomer"<newcomer(a)flounder.com> wrote in
>>> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1(a)4ax.com...
>>>> Actually, what it does is give us another opportunity to
>>>> point how how really bad this
>>>> design choice is, and thus Peter can tell us all we are
>>>> fools for not answering a question
>>>> that should never have been asked, not because it is
>>>> inappropriate for the group, but
>>>> because it represents the worst-possible-design decision
>>>> that could be made.
>>>> joe
>>>
>>> Come on Joe, give Mr. Olcott some credit. I'm sure that he
>>> could dream up an even worse design as he did with his OCR
>>> project once he is given (and ignores) input from the
>>> professionals whos input he claims to seek. ;)
>>>
>>>
>>> -Pete
>>>
>>>
>>
>> Most often I am not looking for "input from professionals",
>> I am looking for answers to specific questions.
>>
>> I now realize that every non-answer response tends to be a
>> mask for the true answer of "I don't know".
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on
On 5/16/2010 11:28 PM, Joseph M. Newcomer wrote:
> See below...
> On Fri, 14 May 2010 13:44:56 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>
>>
>> "Pete Delgado"<Peter.Delgado(a)NoSpam.com> wrote in message
>> news:O8vhKE58KHA.980(a)TK2MSFTNGP04.phx.gbl...
>>>
>>>> Most often I am not looking for "input from
>>>> professionals", I am looking for answers to specific
>>>> questions.
>>>
>>> Which is one reason why your projects consistantly fail.
>>> If you have a few
>>
>> None of my projects have ever failed. Some of my projects
>> inherently take an enormous amount of time to complete.
> ****
> No something you should brag about. Going back to my original comments, you are creating
> an artificially complex solution to what should be a simple problem, by making bad design
> choices and then warping reality to support them, when the correct answer is "Don't do it
> that way". If you simplify the problem, you get do make decisions which can be
> implemented more readily, those decreasing the amount of time required to complete them.
> ****
>>
>>> days, take a look at the book "Programming Pearls" by Jon
>>> Bentley -specifically the first chapter. Sometimes making
>>> sure you are asking the *right* question is more important
>>> than getting an answer to a question. You seem to have a
>>> problem with that particular concept.
>>
>> Yes especially on those cases where I have already thought
>> the problem through completely using categorically
>> exhaustively complete reasoning.
> *****
> There is no such thing in the world we live in. You have made a number of false
> assumptions (for example, that conversion time is statistically significant relative to
> other performance issues) and used that set of false assumptions to drive a set of design
> decisions which make no sense if you take reality into consideration. For example, these
> is no possible way the UTF-8-UTF-16 conversion could possibly take longer to handle than a
> single page fault, but you are optimizing it out of existence without realizing that
> simply loading the program will have orders of magnitude greater variance than this cost.
> This is because you are working with the assumptions that (a) loading a program takes
> either zero time or a fixed time each time it is loaded and (b) opening the file you are
> reading takes either zero time or a fixed time each time it is opened. Sadly, neither of
> these assumptions are valid, and consequently if you run 100 experiments or loading and
> executing the program, these two paramters will dominate the total performance by orders
> of magnitude more than the cost of the conversion! So you are trying to optimize
> something that is statistically insignificant!
> ****
>>
>> In those rare instances anything at all besides a direct
>> answer to a direct question can only be a waste of time for
>> me.
> *****
> You want a direct answer: the design to use UTF-8 internally is a Really Stupid Idea!
> DON'T WASTE YOUR TIME TRYING TO DO IT! That's the DIRECT answer. Everything else is
> wasting our time trying to tell you in simple words that even you might understand just
> WHY it is a Really Stupid Idea.
>
> There is no point in trying to analye the regexp because I can not believe why any
> intelligent programmer would WANT to use such a bad design! Therefore, it was a bad
> question and does not deserve getting an answer;

Ultimately all UTF-8 validators must be regular expressions implemented
as finite state machines. I can't imagine a better way.

> the correct answer is to do the job
> right. You have this fixation that if you pose what is clearly a bad design, we experts
> are supposed to sit back and encourage bad design decisions? That is not what we do.
>
> We feel a little bit like Calvin's dad from the old "Calvin and Hobbes" cartoons. Calvin
> comes over to his father and says "Dad, can I have a chain saw" and his father says "no".
> Calvin goes away feeling unhappy, and in the last of the four panels says "but now how am
> I going to learn how to juggle?"
>
> If you want to juggle chain saws, we aren't going to answer your questions on how to do
> it. We will try to advise you that juggling running chain saws is probably a Really
> Stupid Idea. If you were an experienced knife juggler, and could juggle flaming torches,
> we might suggest that there are approaches to this, but your idea that you can apply
> categorical reasoning to the problem of chain-saw juggling when you have clearly
> demonstrated by your question that you have never once juggled anything, makes us leery of
> encouraging you to continue this practice.
>
> Note that "categorical reasoning" does not turn into a deep understanding of fundamentally
> stochastic processes. Las Vegas casinos would love you, because you would try to apply
> this technique to, say, roulette wheels and dice, and guess who wins?
>
> Prove, by exhaustive categorical reasoning, that loading a program takes a fixed amount of
> time. Then I'll credit its power.
> joe
> ****
>>
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on
On 5/16/2010 11:33 PM, Joseph M. Newcomer wrote:
> See below...
> On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>
>>
>> "Joseph M. Newcomer"<newcomer(a)flounder.com> wrote in
>> message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph(a)4ax.com...
>>> No, an extremely verbose "You are going about this
>>> completely wrong".
>>> joe
>>
>> Which still avoids rather than answers my question. This was
>> at one time a very effective ruse to hide the fact that you
>> don't know the answer. I can see through this ruse now, so
>> there is no sense in my attempting to justify my design
>> decision to you. That would simply be a waste of time.
> ****
> I think I answered part of it. The part that matters. THe part that says "this is
> wrong". I did this by pointing out some counterexamples.
>
> I know the answer: Don;t Do It That Way. You are asking for a specific answer that will
> allow you to pursue a Really Bad Design Decision. I'm not going to answer a bad question;
> I'm going to tell you what the correct solution is. I'm avoiding the question because it
> is a really bad question, because you should be able to answer it yourself, and because
> giving an answer simply justifies a poor design. I don't justify poor designs, I try to
> kill them.
>
> Only you could make a bad design decision and feel you have to justify it. Particularly
> when the experts have already all told you it is a bad design decision, and you should not
> go that way.
> joe

If a decision is truly bad, then there must be dysfunctional results
that make the decision a bad one. If dysfunctional results can not be
provided, then the statement that it is a bad decision lacks sufficient
support. My original intention was to use UTF-32 as my internal
representation. I have not yet decided to alter this original decision.

The fact that someone provided an example where UTF-8 strings would
often substantially vary in length provides the best counter example
showing that your view is likely correct about internal representation.

In fact I will simply state that I am now convinced that UTF-32 is the
best way to go.

I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
completed using Lex and Yacc. Besides this I need a good way to parse
UTF-8 to convert it to UTF-32.
From: Peter Olcott on
On 5/16/2010 11:39 PM, Joseph M. Newcomer wrote:
>>
>> That is how I intend to use it. To internationalize my GUI
>> scripting language the interpreter will accept UTF-8 input
>> as its source code files. It is substantially implemented
>> using Lex and Yacc specifications for "C" that have been
>> adapted to implement a subset of C++.
> *****
> So why does the question matter? Accepting UTF-8 input makes perfect sense, but the first
> thing you should do with it is convert it to UTF-16, or better still UTF-32.
> ****
>>
>> It was far easier (and far less error prone) to add the C++
>> that I needed to the "C" specification than it would have
>> been to remove what I do not need from the C++
>> specification.
> ***
> Huh? What's this got to do with the encoding?
(1) Lex requires a RegEx

(2) I still must convert from UTF-8 to UTF-32, and I don't think that a
faster or simpler way to do this besides a regular expression
implemented as a finite state machine can possibly exist.


>> The actual language itself will store its strings as 32-bit
>> codepoints. The SymbolTable will not bother to convert its
>> strings from UTF-8. It turns out that UTF-8 byte sort order
>> is identical to Unicode code point sort order.
> ****
> Strange. I though sort order was locale-specific and independent of code points. But
> then, maybe I just understand what is going on.

The SymbolTable only needs to be able to find its symbols in a std::map.
Accounting for locale specific sort order is a waste of time in this case.