From: Peter Olcott on
On 5/20/2010 12:52 PM, James Kanze wrote:
> On May 19, 6:45 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> On 5/19/2010 12:23 PM, James Kanze wrote:
>
> [...]
>>> And how do you act on an ActionCode? Switch statements and
>>> indirect jumps are very, very slow on some machines (but not on
>>> others).
>
>> I could not imagine why they would ever be very slow. I know
>> that they are much slower than an ordinary jump because of the
>> infrastructure overhead. This is only about one order of
>> magnitude or less.
>
> On some machines (HP PA architecture, for example), any indirect
> jump (at the assembler level) will purge the pipeline, resulting
> in a considerable slowdown. And the classical implementation of
> a dense switch uses a jump table, i.e. an indirect jump. (The
> alternative involves a number of jumps, so may not be that fast
> either.)
>
> This is not universal, of course---I've not noticed it on
> a Sparc, for example.
>
> --
> James Kanze

Ah now it makes much more sense. back to the cache locality of reference
again.
From: Joseph M. Newcomer on
Actually, cache locality is only ONE of the parameters. Instruction pipe depth, and as
pointed out, pipe flushing; speculative execution (such as the x86s and many other
architectures do very well), dynamic register renaming, L2 cache vs. L1 cache, operand
prefetching, the depth of the operand lookahead pipe, etc. all come into play. In the
case of the x86, these vary widely across families of chips; lower-power (e.g., laptop)
chips generally have fewer of these features than server-oriented chipsets (e.g., high-end
Xeon, and i9). All these features involve more transistors, and higher clock speeds, and
both of these translate into higher power requirements, Little factors like TLB
collisions and TLB flush rates can change performance by integer multipliers, not just
single-digit percentages. The effects of network traffic and other kernel activities,
which impact the pipelines, TLB, caches, etc. can be quite disruptive to pretty models of
behavior, even if you manage to model precisely what is going on in the abstract chip set.
I've seen my desktop report processing 1K interrupts/second, so 1K times per second my
idealized model of cache management gets scrambled by code I have no control over. This
is reality. This is why NOTHING matters except MEASURED performance. Not theoretical
performance, not performance under some "ideal conditions" model, bit performance
predicted by counting instructions or guessing at memory delays, but ACTUAL, MEASURED
performance.

This is why the only measure of performance is actual execution, and your numbers are
valid ONLY on the machine and under the conditions you measure them with, and do not
necessarily predict good performance on a different CPU model or different motherboard
chipset.
joe
On Thu, 20 May 2010 13:07:20 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 12:52 PM, James Kanze wrote:
>> On May 19, 6:45 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>> On 5/19/2010 12:23 PM, James Kanze wrote:
>>
>> [...]
>>>> And how do you act on an ActionCode? Switch statements and
>>>> indirect jumps are very, very slow on some machines (but not on
>>>> others).
>>
>>> I could not imagine why they would ever be very slow. I know
>>> that they are much slower than an ordinary jump because of the
>>> infrastructure overhead. This is only about one order of
>>> magnitude or less.
>>
>> On some machines (HP PA architecture, for example), any indirect
>> jump (at the assembler level) will purge the pipeline, resulting
>> in a considerable slowdown. And the classical implementation of
>> a dense switch uses a jump table, i.e. an indirect jump. (The
>> alternative involves a number of jumps, so may not be that fast
>> either.)
>>
>> This is not universal, of course---I've not noticed it on
>> a Sparc, for example.
>>
>> --
>> James Kanze
>
>Ah now it makes much more sense. back to the cache locality of reference
>again.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on
On 5/20/2010 7:31 PM, Joseph M. Newcomer wrote:
> Actually, cache locality is only ONE of the parameters. Instruction pipe depth, and as
> pointed out, pipe flushing; speculative execution (such as the x86s and many other
> architectures do very well), dynamic register renaming, L2 cache vs. L1 cache, operand
> prefetching, the depth of the operand lookahead pipe, etc. all come into play. In the
> case of the x86, these vary widely across families of chips; lower-power (e.g., laptop)
> chips generally have fewer of these features than server-oriented chipsets (e.g., high-end
> Xeon, and i9). All these features involve more transistors, and higher clock speeds, and
> both of these translate into higher power requirements, Little factors like TLB
> collisions and TLB flush rates can change performance by integer multipliers, not just
> single-digit percentages. The effects of network traffic and other kernel activities,
> which impact the pipelines, TLB, caches, etc. can be quite disruptive to pretty models of
> behavior, even if you manage to model precisely what is going on in the abstract chip set.
> I've seen my desktop report processing 1K interrupts/second, so 1K times per second my
> idealized model of cache management gets scrambled by code I have no control over. This
> is reality. This is why NOTHING matters except MEASURED performance.

This is a gross over exaggeration. I once had a non techie boss that
wrote a program that read his data from disk fifty times because there
was fifty different kinds of data. It could be easily known in advance
that there is a much better way to do this.

A more accurate statement might be something like unmeasured performance
estimates are most often very inaccurate. It is also probably true that
faster methods can often be discerned from much slower (at least an
order of magnitude) methods without measurement.

Because I am so fanatical about optimization, and I have done some
further investigation, I am still confident that my UTF-8 recognizer has
the fastest possible design. I would agree with you that this statement
really doesn't count until proven with working code.

> Not theoretical
> performance, not performance under some "ideal conditions" model, bit performance
> predicted by counting instructions or guessing at memory delays, but ACTUAL, MEASURED
> performance.
>
> This is why the only measure of performance is actual execution, and your numbers are
> valid ONLY on the machine and under the conditions you measure them with, and do not
> necessarily predict good performance on a different CPU model or different motherboard
> chipset.
> joe
From: Mihai N. on
> Note that they do refer to ISO 10646 (see the footnote on page 19 of
> the draft standard (30) which has already been cited in this thread).

This is what I was alluding to when I wrote "changing lately." (although not
technically Unicode, ISO 10646 is a good subset, kept in sync with Unicode
pretty well)
(by "subset" I don't mean fewer characters encoded, but "some parts missing",
like all the character properties, and all the UTS-es)


> The key document is referenced as
> http://www.rfc-editor.org/rfc/bcp/bcp47.txt which is
> actually RFC5646. This is a lengthy document but worth reading.

And that is a bad thing, because RFC5646 is a way to tag languages, not
locales.

In most cases there is no difference, but with UTS-35 you can say:
de-DE(a)collation=phonebook (German-Germany with phonebook sorting)
or ar(a)calendar=islamic (Arabic with Islamic calendar)
or even ja-JP(a)calendar=japanese;numbers=jpanfin (Japanese-Japan,
using the Japanese imperial calendar and Japanese financial numerals)

That's something you can't do with RFC5646 (in fact the RFC says
"For systems and APIs, language tags form the basis for most
implementations of locale identifiers." and they send you to UTS-35
as an example)



--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on
> "What is the difference between a computer scientist, a newbie, and a
> software engineer?"
>
> Sounds like a setup for a joke, but it isn't.

A little bit like:
"In theory, there is no difference between theory and practice.
But, in practice, there is."
Also sounds like a joke, but it isn't :-)


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email