New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Peter Olcott on 20 May 2010 10:47

On 5/20/2010 1:54 AM, Hector Santos wrote:
> Good points P. Delgado. Lets also make a few major notes:
>
> - He has no products,
That have been completed yet. A working prototype has been completed for
many years.

> - He has no customers,
One customer / reseller as soon as OCR4Screen is available

> - He has no competition!
Wrong I have four competitors that can not match the ballpark of my
speed or accuracy.

> - He has been at this for past 9-10 years.
Since 1998.

> - He has the IQ of a Pre-Med student, hence he is smarter than us,
I have the IQ of a medical doctor, which means that I am about average
for these groups. I am probably not as smart as Joe. I certainly have
very much less knowledge than Joe.

> - His process once required 5 gb of resident PURE memory, then 3gb
> then 1.5gb.
> - He wasn't familiarr memory virtualization, fragmentation
I have known about fragmentation since the beginning of my career. Joe
taught me a few nuances about virtual memory that I was unaware of.

Most of this entire issue was that I continued to talk about virtual
memory at a higher level of abstraction than the one that Joe was using.

>
> Remember that classic thread? when he finally proven wrong and he
> admitted all his knowledge about an OS comes from a 25 year old OS
> class book and he forgot the read the 2nd half because the final exam
> was cancelled, hence he didn't know about memory virtualization ideas.
> But he was going to catch up now. :)
>
> - He has no concept of threads even when provided thread based code,
This is not true.

> - He believes Multiple Queue/Multiple Servant FIFO is superior
MQMS is will provide better performance in my case depending upon how
this term is applied. I think that the whole issue here is that the term
MQMS was incorrectly applied to my case. You pointed this out once, and
I tend to agree. My point was that on a single CPU machine that only has
one CPU core and no hyper-threading, that adding threads will make each
thread run more slowly.

> - He invented the fastest string class in the world
And the developers of the Microsoft string class were able to match my
speed (from what I recall not beat it) by studying my code and
implementing it in their code. They also provided good reasons why their
original design was so much slower. At least one of these reasons no
longer applied at the time that I wrote FastString.

> - He wanted to make SQLITE3 behave like MYSQL, SEQUEL
I analyzed many alternatives for implementing fast fault tolerant
transactions including doing everything from scratch.

> - He wanted to use ISAM offset ideas for SQL records
I want to make transactions as fast as possible and thus referred back
to the fundamental architecture of database technology.

> - He wants to do all this in a single cpu computer.
From a business point of this this is a good idea. I am growing my own
capitol from the ground up. Too many businesses sink far too much money
into unproven ventures. Many of these ventures may have otherwise become
successful if they only didn't spend so much money so quickly.

> - He wants fault tolerants without DISK I/O.
I never said this you misunderstood me here. The fundamental basis for
all fault tolerance is disk I/O.

>
> Did I miss anything? I'm pretty sure I did. Oh yeah..
>
> - He wants a secured computer at customer sites that no one can touch
> because they might still his software.
>
> Did I mention he has no products? no customers? and no competitor?<g>
>
> Since 2006, his products would be available in the FALL and will be
> available are ActiveX, but oh yeah
>
> - He wants to use Linux with no GUI and in REAL TIME.
>
> But the Linux people don't seem to be too helpful and needs to come to
> the MFC forum because "this is where people answer his patent claim
> questions."

Joe is clearly brilliant and you were more helpful than anyone else
anywhere else pertaining to the design of the fundamental architecture
of my web application.

>
> Go Figure.
>
> --
> HLS
>
> On May 20, 12:13 am, "Pete Delgado"<Peter.Delg...(a)NoSpam.com> wrote:
>> "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote in message
>>
>> news:5O2dnS2UptANt2nWnZ2dnUVZ_rqdnZ2d(a)giganews.com...
>>
>>> Here are the actual results from the working prototype of my original DFA
>>> based glyph recognition engine.
>>> http://www.ocr4screen.com/Unique.html
>>> The new algorithm is much better than this.
>>
>> The salient points that you fail to mention is that the alternative
>> solutions can perform OCR on *any* font while your implementation requires
>> the customer to tell the OCR system which font (including all specifics such
>> as point size) is being used. In addition, the other systems can perform
>> when the font is not consistent in the document or if different font weights
>> are used, your implementation cannot and will fail miserably.
>>
>> All in all, very misleading.
>>
>> PS: The information used in my critique of your OCR system was obtained by
>> looking at your prior posts as well as your patent and are not merely
>> conjecture.
>>
>> -Pete
>

From: Peter Olcott on 20 May 2010 10:56

On 5/20/2010 4:49 AM, Oliver Regenfelder wrote:
> Hello,
>
> Peter Olcott wrote:
>> Yes and quite often with zero percent accuracy at screen resolutions.
>> The most accurate alternative system scored about 25% accuracy on the
>> sample image and was 872-fold slower.
>
> You're sure it wasn't 872.3-fold slower?
>
> Best regards,
>
> Oliver

Mine took 0.047 seconds using clock() ** theirs took 41 second +- 1
second using my wrist watch.

** I think that the resolution might be to the millisecond on Intel
architecture. I do remember that it used to be to the 1/18 second long ago.

From: Joseph M. Newcomer on 20 May 2010 11:54

One of my standard presentation lines is

"What is the difference between a computer scientist, a newbie, and a software engineer?"

Sounds like a setup for a joke, but it isn't.

Some years ago, I had a project where I calculated the big-O complexity to be O(n^3)
(actually it was O(n^2) * O(m), but for most purposes m==n was the expected value).

Now, a computer scientist (me) looks at this and says "Wow! O(n^3). Bad! I need to
rethink this a design a new algorithm!" So I did, and it required having a pointer in
each node that allowed me to thread through the tree. Result: O(n). BIG improvement.

A newbie would not know about big-O performance.

An engineer (also me), said "Great. But that increases the size of each node, and the
pointer validity must be maintained under tree transformations which precede this semantic
check, and that's hard." We were running on a very small computer, right at the margins
of storage availability, and the project was a couple weeks behind schedule (I was also,
at that time, sysadmin for our site, and had just poured days down an administrative
rathole, with no end in sight). Maintaining the validity of the pointer (easy to
construct during the parse) during the tree transformations (hard) was going to add weeks
to the schedule. Not Acceptable. So I decided to see what the values of n and m were. It
took under fifteen minutes to add code to compute these values and display them. There
were hundreds of calls on this function for processing large grammars. The result:
n==m==1 for almost all cases; for a few cases, ==2, for a couple cases, ==3, and with our
largest formal grammar, ONE instance of ==4! So O(n^3) doesn't matter when n is very
small. That's engineering!

So design, even if big-O IS the same, does not prove anything until you know the
parameters of the O(f(n)) computation, as realized by ACTUAL DATA. One design where n >
100 can be quite different from another design where n==1, even though O(n) is identifcal
for both designs! Clearly, in the above example, I had two designs with quite different
f(n) for O(f(n)) but the results were essentially identical if n==1. By analogy, if I had
two designs with O(n^3) one could be excellent, and the other a total flaming disaster, if
they had values of n==1 and n > 100.

So while big-O is an important concept, it must be applied with judgment.

Also, remember that O(f(n)) means the real equation is k + C * f(n) + t where k is the
setup time for the computation, C is the constant of proportionality, and t is the
teardown time for the computation (often t==0 so we ignore it). In some cases, k and C
dominate performance, e.g., the string-compare example I've cited many times before, where
I found I was spending all my time in the equivalent of strcmp (C was HUGE), and when I
reduced C to 1 clock cycle, I got excellent performance of an f(n) = n log2 n algorithm.

I have a multithreading example where k dominates for small values of n > 1 (the curve of
performance is interesting; it goes UP for a while as the number of threads increases,
then goes DOWN until the number of threads == number of CPU cores, then starts slowly
going UP again as the number of threads increased beyond the number of cores)

[Note that algorithm is not to be confused with "Think Green! Think Green! Think Green!",
which is the Al Gore Rhythm]
joe
joe

On Thu, 20 May 2010 00:03:30 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>> Design never proves anything with regards to speed, at least as
>> long as the big-O is the same.
>
>Not to mention that big-O tells you something only for relatively
>big values of 'n' (how big is 'big' depends on the algorithm).
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 20 May 2010 12:12

See below...
On Wed, 19 May 2010 23:54:00 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>> If your compiler defines wchar_t as 16 bits, then
>> it implies UTF-16 encoding
>
>Nope. wchar_t does not imply Unicode.
****
True. wchar_t is actually a multibyte encoding which can encode according to arbitrary
implementation-specified standards. I stand corrected.

The Microsoft C compiler, however, will interpreter wchar_t literals as Unicode. And
since the discussion was about UTF encodings, whcar_t implies Unicode's UTF-16. That's
where the confusion was; the statement was made in the context of a Unicode encoding
discussion. But you're right; in the formal defiinition of wchar_t, all it claims is that
this is an implementation-defined width.
****
>I think this is caused by the great reluctance of the C/C++ standards
>to refer to other standards. They try to be self-sufficient.
****
Which is probably a Good Thing.
****.
>
>Happily enough, this seems to be changing lately (still too slow).
****
Note that they do refer to ISO 10646 (see the footnote on page 19 of the draft standard
(30) which has already been cited in this thread).
****
>
>
>> Well, the locale names are supposed to be the ISO standard
>> string designators
>
>From what I know, that is not specified anywhere in the C/C++ standard.
>A locale can be anything you want it to be.
>POSIX added something, but it is quite outdated.
****
Perhaps I'm remembering some other proposal; I just checked the C standard, and the only
locale actually supported by Standard C is "C". All other names are unspecified. So it
is up to the implementor to decide what is going on, thus impacting portability.
****
>
>UTS-35 (Unicode Technical Standard #35, http://unicode.org/reports/tr35/)
>is the best thing right now. And you can use it with ICU (again the best
>platform-independent solution for locale aware support (ICU has it's own
>problems though))
****
The key document is referenced as http://www.rfc-editor.org/rfc/bcp/bcp47.txt which is
actually RFC5646. This is a lengthy document but worth reading.

This is a case where the C language should reference other standards.
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: James Kanze on 20 May 2010 13:52

On May 19, 6:45 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/19/2010 12:23 PM, James Kanze wrote:

[...]
> > And how do you act on an ActionCode? Switch statements and
> > indirect jumps are very, very slow on some machines (but not on
> > others).

> I could not imagine why they would ever be very slow. I know
> that they are much slower than an ordinary jump because of the
> infrastructure overhead. This is only about one order of
> magnitude or less.

On some machines (HP PA architecture, for example), any indirect
jump (at the assembler level) will purge the pipeline, resulting
in a considerable slowdown. And the classical implementation of
a dense switch uses a jump table, i.e. an indirect jump. (The
alternative involves a number of jumps, so may not be that fast
either.)

This is not universal, of course---I've not noticed it on
a Sparc, for example.

--
James Kanze

First | Prev | Next | Last
Pages: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish