From: Robert Myers on
On Oct 24, 6:31 pm, Bernd Paysan <bernd.pay...(a)gmx.de> wrote:

> One interesting property of quantum mechanics is that for irreversible
> logic, there's a minimum amount of energy that is necessary to make it
> happen.  Reversible logic does not have this drawback.  Therefore,
> people investigate into reversible logic, even though the actual
> components to get that benefit are not in sigh (not even carbon nanotube
> switches have these properties, even though they are much closer to the
> physical limits for irreversible logic).  Many people also forget that
> quantum mechanics does not properly take changes in the system into
> account, and that means that your reversible logic only works with the
> predicted low power when the inputs are not changing any more - and this
> is just the uninteresting case (the coherent one - changes in the system
> lead to decoherence, and thereby to classical physics).

Let's see. Quantum mechanics properly applied takes account of
everything in the whole universe, which is, so far as I know, quantum
mechanical and reversible in it's entirety. If you could isolate
parts of the system, like your computing apparatus, then it would be
like a universe that is quantum mechanical and reversible in its
entirety. Such a device would have little use to us, because we could
neither give it new problems to work on nor read the results when it's
done.

In order to give the device a new problem, we must disturb it, but the
system can still retain enough coherence to function as a quantum
mechanical device. Only the entropy involved in the process of giving
the device input and reading the output has an irreducible cost in
energy that we must put on to the electric bill, as we will never get
it back, except as waste heat.

Thus, even though you can't do operations with *no* net cost in
energy, we can still build and operate devices that act as quantum
mechanical computers to an arbitrarily good approximation. Writing to
them and reading from them is always an irreversible process that, if
repeated often enough, will eventually lead to the device having no
useful quantum mechanical coherence left for us to exploit, as we have
destroyed it all through our reading and writing. In the interim, we
can do an awful lot of computation. Otherwise, "quantum computers"
would not be possible.

I'm having a hard time reconciling how I understand the problem with
what you just said, which seems too sweeping and too black and white.
Can you help me out?

Robert.
From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
> In article <4AE12FA9.1000706(a)patten-glew.net>,
> Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>> Robert Myers wrote:
>>
>> I am not aware of an Itanium shipped or proposed that had an "x86 core
>> on the side".
>
> I am. I can't say how far the proposal got - it may never have got
> beyond the back of the envelope stage, and then got flattened as a
> project by management.

I am reasonably certain that you are misremembering or misunderstanding
a presentation that may have oversimplified things.
From: Andrew Reilly on
On Sat, 24 Oct 2009 12:25:40 -0700, Robert Myers wrote:

> I don't have any insight into what being architecture-naive on the other
> architectures might be, but, for Itanium, you have to start with deep
> insight into the code in order to get a payback on all the fancy bells
> and whistles. Itanium should be getting more instructions per clock,
> not significantly fewer (that *was* the idea, wasn't it?).

I've not used an Itanium, but it would seem to have quite a bit of
practical similarity to the Texas Instruments TIC6000 series of VLIW DSP
processors, in that it is essentially in-order VLIW with predicated
instructions and some instruction encoding funkiness. That whole idea is
*predicated* on being able to software-pipeline loop bodies and do enough
iterations to make them a worthwhile fraction of your execution time.
From memory, Anton's TeX benchmark is the exact opposite: strictly
integer code of the twistiest non-loopy conditional nature. I would not
expect even a heroic compiler to get *any* significant parallel issues
going, at which point it falls back to being an in-order RISC-like
machine: not dramatically unlike a pre-Cortex ARM, or SPARC, as you said.

Now, Texas' compilers for the C6000 *are* heroic, and I've seen them
regularly schedule all eight possible instruction slots active per cycle,
for appropriate DSP code. The interesting thing is that this process is
*extremely* fragile. If the loop body contains too many instructions
(for whatever reason), or some other limitation, then the compiler seems
to throw up its hands and give you essentially single-instruction-per-
cycle code, which is (comparatively) hopeless. Smells like a box full of
heuristics, rather than reliable proof. The only way to proceed is to
hack the source code into little pieces and try variations until the
compiler behaves "well" again.

At least the TI parts *do* get low power consumption out of the deal, and
since they clock more slowly they don't have quite so many cycles to wait
for a cache miss. And no-one is trying to run TeX on them...

Cheers,

--
Andrew
From: Robert Myers on
On Oct 24, 9:40 pm, Andrew Reilly <andrew-newsp...(a)areilly.bpc-
users.org> wrote:
> On Sat, 24 Oct 2009 12:25:40 -0700, Robert Myers wrote:
> > I don't have any insight into what being architecture-naive on the other
> > architectures might be, but, for Itanium, you have to start with deep
> > insight into the code in order to get a payback on all the fancy bells
> > and whistles.  Itanium should be getting more instructions per clock,
> > not significantly fewer (that *was* the idea, wasn't it?).
>
> I've not used an Itanium, but it would seem to have quite a bit of
> practical similarity to the Texas Instruments TIC6000 series of VLIW DSP
> processors, in that it is essentially in-order VLIW with predicated
> instructions and some instruction encoding funkiness.  That whole idea is
> *predicated* on being able to software-pipeline loop bodies and do enough
> iterations to make them a worthwhile fraction of your execution time.  
> From memory, Anton's TeX benchmark is the exact opposite: strictly
> integer code of the twistiest non-loopy conditional nature.  I would not
> expect even a heroic compiler to get *any* significant parallel issues
> going, at which point it falls back to being an in-order RISC-like
> machine: not dramatically unlike a pre-Cortex ARM, or SPARC, as you said.
>
> Now, Texas' compilers for the C6000 *are* heroic, and I've seen them
> regularly schedule all eight possible instruction slots active per cycle,
> for appropriate DSP code.  The interesting thing is that this process is
> *extremely* fragile.  If the loop body contains too many instructions
> (for whatever reason), or some other limitation, then the compiler seems
> to throw up its hands and give you essentially single-instruction-per-
> cycle code, which is (comparatively) hopeless.  Smells like a box full of
> heuristics, rather than reliable proof.  The only way to proceed is to
> hack the source code into little pieces and try variations until the
> compiler behaves "well" again.
>
> At least the TI parts *do* get low power consumption out of the deal, and
> since they clock more slowly they don't have quite so many cycles to wait
> for a cache miss.  And no-one is trying to run TeX on them...
>
I get so tense here, trying to make sure I don't make a grotesque
mistake.

Your post made me chuckle. Thanks. I actually didn't even look at
the TeX numbers, only the ones I had first relied upon. As a seventh-
grade teacher remarked, my laziness might one day be my undoing.

Thanks for calling attention to the TI compiler. I've looked at the
TI DSP chips, but never gotten further.

You know just how heroic a heroic compiler really is. I don't know
whether David Dinucci (did I get it right?) is still following.

Robert.
From: Robert Myers on
On Oct 24, 9:59 pm, Robert Myers <rbmyers...(a)gmail.com> wrote:
> On Oct 24, 9:40 pm, Andrew Reilly <andrew-newsp...(a)areilly.bpc-
>
>
>
> users.org> wrote:
> > On Sat, 24 Oct 2009 12:25:40 -0700, Robert Myers wrote:
> > > I don't have any insight into what being architecture-naive on the other
> > > architectures might be, but, for Itanium, you have to start with deep
> > > insight into the code in order to get a payback on all the fancy bells
> > > and whistles.  Itanium should be getting more instructions per clock,
> > > not significantly fewer (that *was* the idea, wasn't it?).
>
> > I've not used an Itanium, but it would seem to have quite a bit of
> > practical similarity to the Texas Instruments TIC6000 series of VLIW DSP
> > processors, in that it is essentially in-order VLIW with predicated
> > instructions and some instruction encoding funkiness.  That whole idea is
> > *predicated* on being able to software-pipeline loop bodies and do enough
> > iterations to make them a worthwhile fraction of your execution time.  
> > From memory, Anton's TeX benchmark is the exact opposite: strictly
> > integer code of the twistiest non-loopy conditional nature.  I would not
> > expect even a heroic compiler to get *any* significant parallel issues
> > going, at which point it falls back to being an in-order RISC-like
> > machine: not dramatically unlike a pre-Cortex ARM, or SPARC, as you said.
>
> > Now, Texas' compilers for the C6000 *are* heroic, and I've seen them
> > regularly schedule all eight possible instruction slots active per cycle,
> > for appropriate DSP code.  The interesting thing is that this process is
> > *extremely* fragile.  If the loop body contains too many instructions
> > (for whatever reason), or some other limitation, then the compiler seems
> > to throw up its hands and give you essentially single-instruction-per-
> > cycle code, which is (comparatively) hopeless.  Smells like a box full of
> > heuristics, rather than reliable proof.  The only way to proceed is to
> > hack the source code into little pieces and try variations until the
> > compiler behaves "well" again.
>
> > At least the TI parts *do* get low power consumption out of the deal, and
> > since they clock more slowly they don't have quite so many cycles to wait
> > for a cache miss.  And no-one is trying to run TeX on them...
>
> I get so tense here, trying to make sure I don't make a grotesque
> mistake.
>
> Your post made me chuckle.  Thanks.  I actually didn't even look at
> the TeX numbers, only the ones I had first relied upon.  As a seventh-
> grade teacher remarked, my laziness might one day be my undoing.
>
> Thanks for calling attention to the TI compiler.  I've looked at the
> TI DSP chips, but never gotten further.
>
> You know just how heroic a heroic compiler really is.  I don't know
> whether David Dinucci (did I get it right?) is still following.

Forgive me for responding to my own post. It was right here, in this
forum, that Linus Tovalds, the one, the only, declared the stupidity
of software pipelinining because he was, well, you know, used to OoO
processors.

This is an amazing place. Kudos to Terje who straightened me out.

You can find him on David Kanter's forum, if you're still interested.

Robert.