From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Andy "Krazy" Glew <ag-news(a)patten-glew.net> wrote:

> The ARM Cortex A9 CPU is out-of-order, and is becoming more and more
> widely used in things like cell phones and iPads.

Cortex A9 is not shipping in any product yet (I believe). Lots of
preannouncements though. The Apple A4 CPU is currently believed to be a
tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
for Samsung before being acquired by Apple.

Someone with a jailbroken iPad (or having paid the 99$ fee) could run
benchmarks to probe the properties of the CPU.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Robert Myers on
On Apr 17, 11:07 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:

> I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation
> in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style
> microarchitectures.
>
> The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how
> much to invest in either half.  Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for
> some workloads.
>
A significant part of the resources of the Cray 1 were nearly useless
to almost no matter what customer bought and/or used such machines.
Machines were bought by customers who had no use for vector registers
and by customers for whom there was a whole class of scalar registers
that were nearly beside the point.

However difficult those choices may have been (and I'm not sure they
weren't less important than the cost and cooling requirements of the
memory), the machines were built and people bought and used them.

I don't think the choices are nearly as hard now. Transistors are
nearly free, but active transistors consume watts, which aren't free.
There are design costs to absorb, but you'd rather spread those costs
over as many chips as possible, even if it means that most customers
have chips with capabilities they never use. So long as the useless
capabilities are idle and consume no watts, everyone is happy.

> I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are
> homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD /
> Coherent threading as well.  Or vice versa.  Minimizng the amount of hardware that can only be used for one class of
> computation.

I thought that was one of the goals of pushing scheduling out to the
compiler. I still don't know whether the goal was never possible or
Itanium was just a hopelessly clumsy design.

Robert.
From: jgd on
In article
<7ca97a3f-11a0-47d4-b9fa-181024a9e9c8(a)z3g2000yqz.googlegroups.com>,
rbmyersusa(a)gmail.com (Robert Myers) wrote:

> I thought that was one of the goals of pushing scheduling out to the
> compiler. I still don't know whether the goal was never possible or
> Itanium was just a hopelessly clumsy design.

It seems to have been possible for a limited class of problems: ones
where you could use profile-guided optimisation on a relatively small
amount of critical code that consumed almost all the CPU time, with
example data that truly represented (almost) all of the data that was
likely to be put through that critical code, and which used a fairly
small selection of the possible code paths through the critical code.

This comes down, in practice to "code that's much like the benchmarks
that were studied before designing the architecture". Which is rather
different from all the code that people want to run on high-performance
computers.

So while pushing scheduling out the compiler was *possible*, it doesn't
seem to have been *practical*. I still bear many scars from the Itanium,
and it's had the effect of making many of the ideas used in unattractive
for years to come.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: nmm1 on
In article <2dOdnQbD_v6TkFbWnZ2dnUVZ8lCdnZ2d(a)giganews.com>,
<jgd(a)cix.compulink.co.uk> wrote:
>In article
><7ca97a3f-11a0-47d4-b9fa-181024a9e9c8(a)z3g2000yqz.googlegroups.com>,
>rbmyersusa(a)gmail.com (Robert Myers) wrote:
>
>> I thought that was one of the goals of pushing scheduling out to the
>> compiler. I still don't know whether the goal was never possible or
>> Itanium was just a hopelessly clumsy design.
>
>It seems to have been possible for a limited class of problems: ones
>where you could use profile-guided optimisation on a relatively small
>amount of critical code that consumed almost all the CPU time, with
>example data that truly represented (almost) all of the data that was
>likely to be put through that critical code, and which used a fairly
>small selection of the possible code paths through the critical code.

Precisely. Exactly as every expert expected.

What seems to have happened is that a few commercial compscis[*]
demonstrated that working on some carefully selected programs, and
persuaded the decision makers that they could deliver it on most of
the important, performance critical, codes. The fact that it was
known to be infeasible, and had been for 25 years, was ignored.
I have no idea which people were responsible for that, though I
have heard that anyone who queried the party line was howled down
and moved to other work. But that's hearsay.

I said that the project would fail, and why, in detail, in 1995/6.
One of the two aspects they partially fixed up (the interrupt one),
at great difficulty and by dropping one of the most important
performance features. The other was a spectacular failure, for
precisely the reasons I gave. And I have never claimed to be a
world expert - all I was using was common knowledge to people who
had worked in or with those areas.

[*] NOT a flattering term.


Regards,
Nick Maclaren.
From: Robert Myers on
On Apr 18, 10:40 am, n...(a)cam.ac.uk wrote:
> In article <2dOdnQbD_v6TkFbWnZ2dnUVZ8lCdn...(a)giganews.com>,
>
>  <j...(a)cix.compulink.co.uk> wrote:
> >In article
> ><7ca97a3f-11a0-47d4-b9fa-181024a9e...(a)z3g2000yqz.googlegroups.com>,
> >rbmyers...(a)gmail.com (Robert Myers) wrote:
>
> >> I thought that was one of the goals of pushing scheduling out to the
> >> compiler.  I still don't know whether the goal was never possible or
> >> Itanium was just a hopelessly clumsy design.
>
> >It seems to have been possible for a limited class of problems: ones
> >where you could use profile-guided optimisation on a relatively small
> >amount of critical code that consumed almost all the CPU time, with
> >example data that truly represented (almost) all of the data that was
> >likely to be put through that critical code, and which used a fairly
> >small selection of the possible code paths through the critical code.
>
> Precisely.  Exactly as every expert expected.
>
> What seems to have happened is that a few commercial compscis[*]
> demonstrated that working on some carefully selected programs, and
> persuaded the decision makers that they could deliver it on most of
> the important, performance critical, codes.  The fact that it was
> known to be infeasible, and had been for 25 years, was ignored.
> I have no idea which people were responsible for that, though I
> have heard that anyone who queried the party line was howled down
> and moved to other work.  But that's hearsay.
>
> I said that the project would fail, and why, in detail, in 1995/6.
> One of the two aspects they partially fixed up (the interrupt one),
> at great difficulty and by dropping one of the most important
> performance features.  The other was a spectacular failure, for
> precisely the reasons I gave.  And I have never claimed to be a
> world expert - all I was using was common knowledge to people who
> had worked in or with those areas.
>
> [*] NOT a flattering term.
>
One of the more interesting posts you made on this subject was the
amount of state that IA-64 carried and the complexity of the rules
required to operate on that state.

That all that cruft would lead to all kinds of problems seems hardly
surprising, but it also seems hardly intrinsic to VLIW and/or putting
more of the burden of scheduling on the compiler.

My assumption, backed by no evidence, is that HP/Intel kept adding
"features" to get the architecture to perform as they had hoped until
the architecture was sunk by its own features.

You think the problem is fundamental. I think the problem is
fundamental only because of the way that code is written, in a
language that leaves the compiler to do too much guessing for the idea
to have even a hope of working at all.

The early work from IBM *didn't* just look at computation-heavy, very
repetitive HPC-like codes. It examined implausible things like word
processors and found a tremendous amount of predictability in behavior
such as computation paths. Maybe most of that predictability has now
been successfully absorbed by run-time branch predictors, making the
possible gains in trying to it exploit it at the compile stage moot.

Since the world *does* write in languages that defy optimization, and
most of the work on languages does not seem interested in how
optimizable a language is, the net conclusion is the same: the idea
will never work, but not for the almost-mathematical reasons you
claim.

Robert.