From: Robert Myers on
On Oct 24, 10:02 am, an...(a)mips.complang.tuwien.ac.at (Anton Ertl)
wrote:
> Robert Myers <rbmyers...(a)gmail.com> writes:
>
> [Speed of PA-RISC emulation on Itanium]
>
> >If I remember the numbers Anton provided, 50% per clock for untuned
> >code and a less than optimal compiler seems about right
>
> I don't know what you think you remember, but I have not presented
> PA-RISC results, simply because we have no PA-RISC box (for Gforth)
> and nobody has submitted PA-RISC results (for the latex benchmark).
>
> For those who wonder what this is all about, the message that he means
> is <2009Oct22.164...(a)mips.complang.tuwien.ac.at>, and the results
> referred to are
>
> http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-sli...http://www.complang.tuwien.ac.at/franz/latex-bench
>3.
I couldn't get the link to work when I wrote the post. On your scale,
where ia32 is 1.0 performance per cycle, Itanium was between 0.35 and
0.40, barely better than ARM XScale. I took the ia32 to indicate a
compiler working with a processor that it was well-tuned to schedule
for and the Itanium results as indicative of how code that wasn't
analyzed or scheduled with much insight into ia64 would do. The PA-
RISC code would have been compiled in an environment that was
completely naive of itanium, and I'm not surprised that it can't be
translated into code that does well on itanium (any more than can
ia32).

If the architecture depends heavily on the compiler and the code was
compiled and scheduled by a compiler that's naive of the architecture,
it's hardly surprising that it can't be translated into code that
performs well. That they got ia32 translation to work even acceptably
seems something of a miracle to me.

Robert.
From: Anton Ertl on
jgd(a)cix.compulink.co.uk writes:
>In article <2009Oct24.154356(a)mips.complang.tuwien.ac.at>,
>anton(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> Judging from experience with Linux-Alpha, this probably means that the
>> kernel supports executing IA-32 executables, but needs a helper file
>> for that (on Linux-Alpha it was the emulator), and that file is
>> missing.
>
>What do you get when you run ldd on the IA-32 executable?

[ia64:~/gforth:25338] ldd ./gforth
not a dynamic executable
[ia64:~/gforth:25339] file ./gforth
../gforth: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.8, not stripped

>I'm
>wondering if it needs a different loader, since having one of those
>missing is one way of producing the error message you quote.

It's possible that it needs a different loader. AFAIK ldd then needs
that loader, too. Looking at the strace for ldd, I see:

stat("/lib/ld-linux.so.2", 0x60000fffffe4b480) = -1 ENOENT (No such file or directory)

With that, I found that package I needed to install on this Debian
system (ia32-libs), and now I can run IA32 programs on this IA64
machine. I just ran some simple Gforth benchmarks on it:

sieve bubble matrix fib
0.764 1.000 0.560 1.188 IA64 code (gcc-4.1) on 900MHz Itanium II
1.840 2.284 1.080 2.796 IA32 code (gcc-2.95) on 900MHz Itanium II
0.261 0.299 0.156 0.375 IA32 code (gcc-2.95) on 2.26GHz Pentium 4

(These gcc versions give good performance for Gforth).

Note that this Pentium 4 (released in May 2002 according to Wikipedia)
is contemporary with this Itanium II (released in 2002-07-08 according
to Wikipedia).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Anton Ertl on
Robert Myers <rbmyersusa(a)gmail.com> writes:
>On Oct 24, 10:02=A0am, an...(a)mips.complang.tuwien.ac.at (Anton Ertl)
>wrote:
>> http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-sli...h=
>ttp://www.complang.tuwien.ac.at/franz/latex-bench
>>3.
>I couldn't get the link to work when I wrote the post.

That's no wonder because apparently your Newsreader mutilates it.
Here is is again:

http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-slides.pdf

>On your scale,
>where ia32 is 1.0 performance per cycle,

Different IA32 implementations have different performance per cycle in
the range of 0.55-1.0.

>Itanium was between 0.35 and
>0.40, barely better than ARM XScale.

~0.39, In the same ballpark as the other non-IA32/AMD64 CPUs (~0.34-0.53).

>I took the ia32 to indicate a
>compiler working with a processor that it was well-tuned to schedule
>for and the Itanium results as indicative of how code that wasn't
>analyzed or scheduled with much insight into ia64 would do.

So the PPC, Alpha and ARM results are also due to lack of insight into
the scheduling requirements of the CPU in your opinion?

My theory (which you can find in the text of that slide) is that the
better perfromance of the IA32 and AMD64 implementations on this
benchmark is because they perform indirect-branch prediction and most
of the others do not (hmm, the 21264B also has a kind of
indirect-branch predictor, but the performance is still not so great
at ~0.43; I have no theory for that).

Unless the PA-RISC implementation you are thinking of has an
indirect-branch predictor, I have no reason to expect it to perform
better than ~0.5.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Robert Myers on
On Oct 24, 2:39 pm, an...(a)mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Robert Myers wrote

> >On your scale,
> >where ia32 is  1.0 performance per cycle,
>
> Different IA32 implementations have different performance per cycle in
> the range of 0.55-1.0.
>
> >Itanium was between 0.35 and
> >0.40, barely better than ARM XScale.
>
> ~0.39, In the same ballpark as the other non-IA32/AMD64 CPUs (~0.34-0.53)..
>
> >I took the ia32 to indicate a
> >compiler working with a processor that it was well-tuned to schedule
> >for and the Itanium results as indicative of how code that wasn't
> >analyzed or scheduled with much insight into ia64 would do.
>
> So the PPC, Alpha and ARM results are also due to lack of insight into
> the scheduling requirements of the CPU in your opinion?
>
> My theory (which you can find in the text of that slide) is that the
> better perfromance of the IA32 and AMD64 implementations on this
> benchmark is because they perform indirect-branch prediction and most
> of the others do not (hmm, the 21264B also has a kind of
> indirect-branch predictor, but the performance is still not so great
> at ~0.43; I have no theory for that).
>
> Unless the PA-RISC implementation you are thinking of has an
> indirect-branch predictor, I have no reason to expect it to perform
> better than ~0.5.
>
I don't have enough insight into the other architectures to comment.
I first looked at the chart and said, yup, just like I said, it's a
compiler built and tuned around x86.

I don't have any insight into what being architecture-naive on the
other architectures might be, but, for Itanium, you have to start with
deep insight into the code in order to get a payback on all the fancy
bells and whistles. Itanium should be getting more instructions per
clock, not significantly fewer (that *was* the idea, wasn't it?).
Even with respect to the other architectures, it's only in the pack.
Once you're past the source code and information you can preserve from
it in intermediate representations, you have an expensive space
heater.

I just happened to have your charts fresh in mind when I made the
comment, and neither your results nor the fact that binary translation
doesn't work well is a surprise. My apologies if you feel that I
overinterpreted your numbers and didn't give sufficient credit to your
own analysis.

Robert.
From: Bernd Paysan on
Del Cecchi wrote:
> You could use SOI, no bulk. :-)

There still is a bulk, there is just no substrate, so the bulk is left
floating. The diodes I mentioned are sill there, supplying the bulk
when forward biased (this is the well-known effect of SOI to have
variable gate thresholds through charging and discharging the bulk below
the diodes threshold, unless you add in a real bulk contact like on
stock silicon wafers).

> I don't get the point of the AC. Light bulbs and space heaters are AC
> powered and still disipate power. What did I miss?

I can't tell you. Andy apparently doesn't care much about the physics
behind integrated circuits, his knowledge stops at the gate level. This
is completely ok for digital design, but I wonder why he makes that sort
of suggestions ;-).

One interesting property of quantum mechanics is that for irreversible
logic, there's a minimum amount of energy that is necessary to make it
happen. Reversible logic does not have this drawback. Therefore,
people investigate into reversible logic, even though the actual
components to get that benefit are not in sigh (not even carbon nanotube
switches have these properties, even though they are much closer to the
physical limits for irreversible logic). Many people also forget that
quantum mechanics does not properly take changes in the system into
account, and that means that your reversible logic only works with the
predicted low power when the inputs are not changing any more - and this
is just the uninteresting case (the coherent one - changes in the system
lead to decoherence, and thereby to classical physics).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/