From: Terje Mathisen on
On Oct 25, 3:32 am, Robert Myers <rbmyers...(a)gmail.com> wrote:
> On Oct 24, 9:59 pm, Robert Myers <rbmyers...(a)gmail.com> wrote:
>
>
>
> > On Oct 24, 9:40 pm, Andrew Reilly <andrew-newsp...(a)areilly.bpc-
>
> > users.org> wrote:
> > > On Sat, 24 Oct 2009 12:25:40 -0700, Robert Myers wrote:
> > > > I don't have any insight into what being architecture-naive on the other
> > > > architectures might be, but, for Itanium, you have to start with deep
> > > > insight into the code in order to get a payback on all the fancy bells
> > > > and whistles.  Itanium should be getting more instructions per clock,
> > > > not significantly fewer (that *was* the idea, wasn't it?).
>
> > > I've not used an Itanium, but it would seem to have quite a bit of
> > > practical similarity to the Texas Instruments TIC6000 series of VLIW DSP
> > > processors, in that it is essentially in-order VLIW with predicated
> > > instructions and some instruction encoding funkiness.  That whole idea is
> > > *predicated* on being able to software-pipeline loop bodies and do enough
> > > iterations to make them a worthwhile fraction of your execution time.  
> > > From memory, Anton's TeX benchmark is the exact opposite: strictly
> > > integer code of the twistiest non-loopy conditional nature.  I would not
> > > expect even a heroic compiler to get *any* significant parallel issues
> > > going, at which point it falls back to being an in-order RISC-like
> > > machine: not dramatically unlike a pre-Cortex ARM, or SPARC, as you said.
>
> > > Now, Texas' compilers for the C6000 *are* heroic, and I've seen them
> > > regularly schedule all eight possible instruction slots active per cycle,
> > > for appropriate DSP code.  The interesting thing is that this process is
> > > *extremely* fragile.  If the loop body contains too many instructions
> > > (for whatever reason), or some other limitation, then the compiler seems
> > > to throw up its hands and give you essentially single-instruction-per-
> > > cycle code, which is (comparatively) hopeless.  Smells like a box full of
> > > heuristics, rather than reliable proof.  The only way to proceed is to
> > > hack the source code into little pieces and try variations until the
> > > compiler behaves "well" again.
>
> > > At least the TI parts *do* get low power consumption out of the deal, and
> > > since they clock more slowly they don't have quite so many cycles to wait
> > > for a cache miss.  And no-one is trying to run TeX on them...
>
> > I get so tense here, trying to make sure I don't make a grotesque
> > mistake.
>
> > Your post made me chuckle.  Thanks.  I actually didn't even look at
> > the TeX numbers, only the ones I had first relied upon.  As a seventh-
> > grade teacher remarked, my laziness might one day be my undoing.
>
> > Thanks for calling attention to the TI compiler.  I've looked at the
> > TI DSP chips, but never gotten further.
>
> > You know just how heroic a heroic compiler really is.  I don't know
> > whether David Dinucci (did I get it right?) is still following.
>
> Forgive me for responding to my own post.  It was right here, in this
> forum, that Linus Tovalds, the one, the only, declared the stupidity
> of software pipelinining because he was, well, you know, used to OoO
> processors.
>
> This is an amazing place.  Kudos to Terje who straightened me out.
From: Terje Mathisen on
On Oct 25, 3:32 am, Robert Myers <rbmyers...(a)gmail.com> wrote:
> Forgive me for responding to my own post.  It was right here, in this
> forum, that Linus Tovalds, the one, the only, declared the stupidity
> of software pipelinining because he was, well, you know, used to OoO
> processors.
>
> This is an amazing place.  Kudos to Terje who straightened me out.
>
I don't mind Kudos from you Robert, but I don't think I deserve it
this time:

I didn't post anything about Linus' OoO ideas.

Terje

From: Robert Myers on
On Oct 25, 4:10 am, Terje Mathisen <terje.wiig.mathi...(a)gmail.com>
wrote:
> On Oct 25, 3:32 am, Robert Myers <rbmyers...(a)gmail.com> wrote:> Forgive me for responding to my own post.  It was right here, in this
> > forum, that Linus Tovalds, the one, the only, declared the stupidity
> > of software pipelinining because he was, well, you know, used to OoO
> > processors.
>
> > This is an amazing place.  Kudos to Terje who straightened me out.
>
> I don't mind Kudos from you Robert, but I don't think I deserve it
> this time:
>
> I didn't post anything about Linus' OoO ideas.

Terje, you were kind enough to explain to me, in a private
correspondence, that, no matter how inept the coder or the compiler,
OoO hardware would eventually figure out and exploit a circumstance
where software pipelinging might conceivably have been helpful. That
is to say (not your words, but mine) OoO hardware knows how to do
software pipelining, even if, in some rare awkward instances, it might
take a while. Something about large prime numbers.

Thus, the software pipelining capacity of of itanium is of no real
interest or use at all (words of the Lion of Finland, or wherever it
is, not mine) in the common circumstance, where large prime numbers
rarely come into play.

Please forgive me if I abuse our friendship, as I did not intend to do
so. Of course, if the hardware is in order (as it is in Itanium) the
advice of the Lion of Finland, whose entire universe consisted of x86,
would not obtain.

The further advice of this wizard is, as I have mentioned, available
elsewhere.

Robert.



From: nmm1 on
In article <4AE3A728.3040001(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>nmm1(a)cam.ac.uk wrote:
>>>
>>> I am not aware of an Itanium shipped or proposed that had an "x86 core
>>> on the side".
>>
>> I am. I can't say how far the proposal got - it may never have got
>> beyond the back of the envelope stage, and then got flattened as a
>> project by management.
>
>I am reasonably certain that you are misremembering or misunderstanding
>a presentation that may have oversimplified things.

I got that information from multiple different sources, including
presentations by Tier 1 vendors. As I said, it may have been merely
a draft idea that got the bum's rush as soon as it was shown to the
project management. But it DID get as far as being described as
something that was being considered, at least by some of the people
involved with liaison.

It wouldn't surprise me if the people doing the actual work never
took it seriously for a moment - there are a LOT of proposals that
get a very long way before those people ever hear of them :-(


Regards,
Nick Maclaren.
From: Mayan Moudgill on
Robert Myers wrote:
I didn't post anything about Linus' OoO ideas.
>
>
> Terje, you were kind enough to explain to me, in a private
> correspondence, that, no matter how inept the coder or the compiler,
> OoO hardware would eventually figure out and exploit a circumstance
> where software pipelinging might conceivably have been helpful. That
> is to say (not your words, but mine) OoO hardware knows how to do
> software pipelining,

Not really. There are quite a few situations in which an OoO processor
will not pick the optimal schedule. Think about it this way: OoO
processors, in some sense, completely unroll the loop and pick
instructions on an ASAP basis. This algorithm is NOT the most optimal.

Consider the following (artificial) example:
A = op1 D
B = op2
C = op3 B
D = op4 A,C

Assume that op2 (procdues B) takes 2 cycles, all others take 1 cycle. On
a single issue OoO processor, the schedule might be:
A,B,nop,C,D.
After scheduling the schedule should be:
B,A,C,D

So, greedy scheduling doesn't always give the optimal schedule.

Continuing the unroll completely analogy, in software pipelining, you
unroll the loop completely, and do the same scheduling actions on each
copy of the loop. So, if you move an operation one cycle earlier, you
move it earlier on every copy of the loop. The resulting schedule can be
done using algorithms that result in an optimal schedule for the loop
(and I mean _optimal_; people are using ILP for this problem). Even if
you're using a heuristic approach, you generally get optimal schedules.

As you might guess, the fact that you have to do the same action on each
copy of the loop body (so to speak) can cause the schedule to be
sub-optimal compared to that found by the OoO schedule. Those situations
(mostly) can be resolved by unrolling the loop a certain number of times
prior to software pipelining.

Another deficiency of software pipelining compared with OoO is that SP
assumes particular latencies for operations, including loads. If this
latency is met (e.g. if SP assumed loads hit in L1 but they're actually
hitting in L2), then the dynamic schedule discovered by OoO processors
may well out-perform SP.

> even if, in some rare awkward instances, it might
> take a while. Something about large prime numbers.
>

I'm kind of lost here; which large prime numbers are involved? If you're
talking loops with fractional II, then those might behave better on an
OoO processors. Terje?