From: Robert Myers on
On Apr 17, 6:58 pm, Brett Davis <gg...(a)yahoo.com> wrote:

>
> No, nice guess.
> I dont know of any CPU that stalls on a read after write, instead
> they try and forward the data, and in the rare case when a violation
> occurs the CPU will throw an interrupt and restart the instructions.
> This is an important comp.arch point, so someone will correct me
> if I am wrong.
>
> > The buffer overlap aliasing considerations in C prevents the
> > compiler from automatically rearranging the original LD ST
> > order to be more pipeline friendly for a particular cpu,
> > but Fortran would allow it.
>
> This is close, the answer is a mundane but important point.
> CPUs today have a load to use delay of ~2 cycles from level one
> cache. (less for slower chips, more in faster chips.)
> An OoO chip can try and find other instructions to execute, but
> even they are subject to this delay. (I believe.)
> This copy loop is so small I dont think there is much even a
> Intel/AMD chip can do. I was hoping you would benchmark it. ;)
>
> A lot of the dual issue CPUs are partial OoO and support hit
> under miss for reads, but these will epic fail at running code
> like this faster.
>
> The future of computing (this decade) is lots of simple in-order CPUs.
> Rules of die size, heat and efficiency kick in. Like ATI chips.
>
Regardless of how natural and even gratifying it may be to you, the
outlook you clearly espouse is job security for you but what I would
deem an unacceptable burden for all ordinary mortals.

I believe that even L1 cache delays have varied between at least one
and two cycles, that L2 remains very important even for OoO chips and
that L2 delays vary even more, that L3 delay tradeoffs are something
that someone like, say, Andy would understand, but that most others
wouldn't, and that the circumstances that cause a stall are not always
clear, as evidenced by the discussion here.

If you can write code for the exact CPU and memory setup and test it
and have the time to do lots of tinkering, then super-slick hand-coded
optimizations might be worth talking about in something other than a
programming forum, and there not because the ideas have general
applicability, but because that's the kind of detail that so many
programmers seem keen on.

As it is, the computer architecture tradeoffs, like the tradeoffs in
cache delays, are probably obsessed over by computer architects, but I
can't see the relevance of a particular slick trick in C to any such
decision-making.

Robert.
From: nmm1 on
In article <734df17f-9536-4366-bd83-de1e552cbd1a(a)11g2000yqr.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>>
>One of the more interesting posts you made on this subject was the
>amount of state that IA-64 carried and the complexity of the rules
>required to operate on that state.

That's not actually the aspect I was referring to.

>That all that cruft would lead to all kinds of problems seems hardly
>surprising, but it also seems hardly intrinsic to VLIW and/or putting
>more of the burden of scheduling on the compiler.

That is correct. It impacted badly on the interrupt issue, but much
less on the compiled code one.

>My assumption, backed by no evidence, is that HP/Intel kept adding
>"features" to get the architecture to perform as they had hoped until
>the architecture was sunk by its own features.

That might have happened in the very early days, but the feature set
was more-or-less fixed by 1994 - i.e. before the practical work
really got going. The main change to the architecture thereafter
was to DROP a feature - the asynchronous register loading/unloading.

>You think the problem is fundamental. I think the problem is
>fundamental only because of the way that code is written, in a
>language that leaves the compiler to do too much guessing for the idea
>to have even a hope of working at all.

Not at all. It also applies to other, much simpler, architectures.
The point is that the problem about 'fancy' optimisation is NOT in
the code generation, but in the program analysis. And profile-
based optimisation has precisely the restrictions that John Dallman
described, and had been known to for 25 years and more.

The reason that Fortran outperforms C and C++ on most extreme HPC
architectures isn't because more effort is put into the compiler
(actually, it's almost always less), but because decent Fortran
code is just SO much more analysable.

>The early work from IBM *didn't* just look at computation-heavy, very
>repetitive HPC-like codes. It examined implausible things like word
>processors and found a tremendous amount of predictability in behavior
>such as computation paths. Maybe most of that predictability has now
>been successfully absorbed by run-time branch predictors, making the
>possible gains in trying to it exploit it at the compile stage moot.

That is correct, but there is one extra niggle. Most of the papers
I have seen from compscis on this area have had a fatal flaw - they
look at the post-hoc ILP and predictability. That is a classic piece
of stupidity (and, yes, it's stupid). The massive potential gains
never were real, because they required a perfect oracle - and such
oracles do not exist.

As a result, the predictability isn't good enough (in most programs)
to be worth guessing more than a few branches ahead - and designs
like the Itanic needed successful prediction of dozens.

>Since the world *does* write in languages that defy optimization, and
>most of the work on languages does not seem interested in how
>optimizable a language is, the net conclusion is the same: the idea
>will never work, but not for the almost-mathematical reasons you
>claim.

Eh? I was talking about the inherent optimisability of those very
languages, and always have been doing so. It's a semi-mathematical
constraint based on using those languages, in their current paradigms.
For heaven's sake, I started saying that we needed higher level
languages to tackle this problem back around 1971 - and I was by
no means the first person to do that!


Regards,
Nick Maclaren.
From: Robert Myers on
On Apr 18, 11:25 am, n...(a)cam.ac.uk wrote:
> In article <734df17f-9536-4366-bd83-de1e552cb...(a)11g2000yqr.googlegroups.com>,
> Robert Myers  <rbmyers...(a)gmail.com> wrote:
>
>
>
> >One of the more interesting posts you made on this subject was the
> >amount of state that IA-64 carried and the complexity of the rules
> >required to operate on that state.
>
> That's not actually the aspect I was referring to.
>
> >That all that cruft would lead to all kinds of problems seems hardly
> >surprising, but it also seems hardly intrinsic to VLIW and/or putting
> >more of the burden of scheduling on the compiler.
>
> That is correct.  It impacted badly on the interrupt issue, but much
> less on the compiled code one.
>
> >My assumption, backed by no evidence, is that HP/Intel kept adding
> >"features" to get the architecture to perform as they had hoped until
> >the architecture was sunk by its own features.
>
> That might have happened in the very early days, but the feature set
> was more-or-less fixed by 1994 - i.e. before the practical work
> really got going.  The main change to the architecture thereafter
> was to DROP a feature - the asynchronous register loading/unloading.
>
> >You think the problem is fundamental.  I think the problem is
> >fundamental only because of the way that code is written, in a
> >language that leaves the compiler to do too much guessing for the idea
> >to have even a hope of working at all.
>
> Not at all.  It also applies to other, much simpler, architectures.
> The point is that the problem about 'fancy' optimisation is NOT in
> the code generation, but in the program analysis.  And profile-
> based optimisation has precisely the restrictions that John Dallman
> described, and had been known to for 25 years and more.
>
> The reason that Fortran outperforms C and C++ on most extreme HPC
> architectures isn't because more effort is put into the compiler
> (actually, it's almost always less), but because decent Fortran
> code is just SO much more analysable.
>
I think we are in complete agreement on this point, and I'm not sure
why you think we disagree. The difficulty is not in the compiler, but
in the way the information is presented to the compiler in the first
place.

The immovable obstacle is not the number of instructions you can get
per clock or the degree of useful predictability in actual codes, but
an apparent complete lack of interest on the part of those who create
programming languages on the kind of information that needs to be
passed to the compiler to allow it to schedule successfully.

The advantage of human programmers is *not* that they know about L1
delays. Compilers can know that kind of stuff, too. The advantage of
human programmers is that they have all the information that they
throw away when writing in a language like C.

Robert.
From: jgd on
In article
<734df17f-9536-4366-bd83-de1e552cbd1a(a)11g2000yqr.googlegroups.com>,
rbmyersusa(a)gmail.com (Robert Myers) wrote:

> My assumption, backed by no evidence, is that HP/Intel kept adding
> "features" to get the architecture to perform as they had hoped until
> the architecture was sunk by its own features.

My own view is backed by some anecdotal evidence: I was quite early in
the queue of people learning to do Itanium porting, coming into it in
1998. A couple of things that I was told about the initial design groups
seemed telling: that they were absolutely sure that it was going to be a
stunning success, "because they were Intel and HP", and that it
contained a large number of amazingly bright people, or at least ones
who were considered amazingly bright by their employers; "can't
communicate well with ordinary people" was a quote from someone who
claimed to have worked with them.

This leads me to theorise that too many people's pet ideas got
incorporated, without being properly integrated with each other. The way
that the Register Stack Engine works on integer registers, but not
floating-point ones, the way that the predicate registers are set up -
and some of the instructions that combine them, which I was never able
to understand, and which I never saw compilers use - and such like seem
to support this theory.

There were also a couple of assumptions that turned out to be wrong. The
first was that code size didn't matter. Indeed, memory to store code in
is very cheap, but cache space is not, and the bulk of the instruction
set meant that the caches were less effective, especially if you are not
spending all your time running tight loops that fit in L1, but are doing
a fair amount of branching. That places lots of pressure on bandwidth
and latency between main memory and cache, which didn't seem to be
top-priority areas of the design.

The second was that the instruction set was the key idea. That's really
an idea from the eighties, at latest. By the mid-nineties it should have
becoming clear, especially to people inside Intel with access to the
work being done on the Pentium Pro/II/III code design, that the limits
on processor power had a lot more to do with caches and the speed of
communications between processor and memory: instruction decode no
longer took up enough transistors to be worth pre-optimising at the
expense of code size.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: "Andy "Krazy" Glew" on
On 4/18/2010 1:36 AM, nmm1(a)cam.ac.uk wrote:
> As I have
> posted before, I favour a heterogeneous design on-chip:
>
> Essentially uninteruptible, user-mode only, out-of-order CPUs
> for applications etc.
> Interuptible, system-mode capable, in-order CPUs for the kernel
> and its daemons.

This is almost opposite what I would expect.

Out-of-order tends to benefit OS code more than many user codes. In-order coherent threading benefits manly fairly
stupid codes that run in user space, like multimedia.

I would guess that you are motivated by something like the following:

System code tends to have unpredictable branches, which hurt many OOO machines.

System code you may want to be able to respond to interrupts easily. I am guessing that you believe that OOO has worse
interrupt latency. That is a misconception: OOO tends to have better interrupt latency, since they usually redirect to
the interrupt handler at retirement. However, they lose more work.

(Anecdote: in P6 I asked the Novel Netware guys if they wanted better interrupt latency or minimal work lost. They
preferred the latter, even at the cost of longer interrupt latency. However, we gave them the former, because it was
easier.)

Also, OOO CPUs tend to have more state like TLBs and caches that is not directly related to interrupts, but which
affects interrupt latency.

Finally, it is true that tricks like alternate register sets for interrupt handlers tend to be more prevalent on in-order.

--

I think workloads may be divided into several categories:

a) Trivial throughput oriented - the sort of workload that benefits most from in-order coherent threaded GU style
microarchitectures. Lots of parallelism. Simple instruction and memory coherency.

b) Classical out-of-order workloads: irregular parallelism, pointer chases but also sizeable work at each pointer miss.
Predictable branches.

c) Intermediate: unpredictable branches, but pointers with fan-out/MLP. Classic system code. For that matter, lets
throw interrupt latency into the mix.


OOO dataflow can help speed up system code in class c), but you may lose the benefit due to branch mispredictions.
Better to switch threads than to predict a flakey branch.

Now, code in class c) can also be executed on in-order thread-switching systems. OOO dataflow just improves the latency
of such code, which amounts to reducibg the number of threads needed for a given performance level. Since, in my
experience, there are far fewer threads in class c) than in either of the other classes, reducibg the number of system
threads required seems like a good tradeff.



The taxonomy is not complete. Thse are just the combibnations that I see as most important.