Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]
From: Tim McCaffrey on 1 Oct 2009 19:32
In article <ha2hmu$equ$1(a)smaug.linux.pwf.cam.ac.uk>, nmm1(a)cam.ac.uk says...
>In article <ha2fqp$o3s$1(a)USTR-NEWS.TR.UNISYS.COM>,
>Tim McCaffrey <timcaffrey(a)aol.com> wrote:
>>>Nope. You are thinking at FAR too low a level!
>>The problem is that without understanding the lower level there is an
>>tendency on the part of most programmers (except people on this newsgroup,
>>course :) ), to view anything that fits on a single line of C code to have
>>execution time of one cycle. Including any function calls to functions they
>>The other annoying thing is that with OOO processors, large L1 caches, and
>>multiple processors on a chip, sometimes they are right.
>That's STILL talking at FAR too low a level!
Well, it is and it isn't (IMNSHO). Any really new architecture should account
for things like latency of memory and I/O devices.
For instance, the PCI bus was architected in the days when a 100ns latency
from an I/O device was well within an order of magnitude of the processors
cycle time, so to do PCI bus reads to get the device status was OK. With
today's technology it would FAR better if talking to a PCI I/O device was
strictly a push model, with advance cache coherency/update enabled on the
processors' side (I think Core i7 has this now). But too many hardware
designers still assume a PCI bus read is No Big Deal, and force the device
driver writer to do multiple reads across the bus to get anything done. To
change this requires, of course, OS support (MSI/MSI-X interrupt support).
And, as you would say, that is only one example where low level considerations
have a big impact on the architecture's performance. I've seen too many
designs where the designer(s) assume that if it is done in hardware it is
instaneous. I mean, pick apart all the things IA does
wrong/badly/inefficiently and figure out a better way or a way to do without
it. You probably will not use the result, but it can show you some mistakes
From: Robert Myers on 2 Oct 2009 00:04
On Oct 1, 9:39 pm, Andrew Reilly <andrew-newsp...(a)areilly.bpc-
> Fair enough. (Nice paraphrase, btw!)
> I suspect that our difference of
> opinion comes from the "level" that one might like to be doing the
> experimentation/tuning. You seem to be arguing that we'll only make
> forward progress if we use languages/tools that expose the exact hardware
> semantics so that we can arrange our applications to suit. That may very
> well be the right answer, but it's not one that I like the sound of. I
> would vastly prefer to be able to describe application parallelism in
> something approaching a formalism, sufficiently abstract that it will be
> able to both withstand generations of hardware change and be amenable to
> system tuning. Quite a bit of that sort of tuning is likely to be better
> in a VM or dynamic compilation environment because there's some scope for
> tuning strategies and even (locking) algorithms at run-time.
We are in violent agreement. Nothing in the field ever seems to
happen that way. If there were a plausible formalism that looked like
it would stick, I think it would make a big difference, but that's the
kind of bias I have that Nick snickers at. Short of that, I'd prefer
that the tinkering be done before any metal is cut. Fat chance, I
From: "Andy "Krazy" Glew" on 2 Oct 2009 00:56
Robert Myers wrote:
> On Sep 30, 10:22 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
>> This is the ISA designer's equivalent of "First, do no harm."
> Some of these conversations really confuse me.
> In your "hardbound/softbound" thread, it was concluded (I thought)
> that giving up a factor of two in execution speed was no big deal.
Some of the responders to that thread said they would be happy to give
up 2x performance to get a near-guarantee of no binary code injection
via buffer overflow security holes.
Others said they wouldn't.
Some said they had no bugs.
And Nick and Wilco said that it could not be done, or was of no value,
Myself, I refer people to the HardBound and SoftBound papers, that
quotes a much lower than 2x performance cost. More like 1.15x. I think
that is a reasonable tradeoff.
> How much do you have to give up to present almost any ISA you want by
> way of a virtual machine?
To present an ISA that is doing the same work - with good binary
translation or cross compilation, you give up little.
But, we aren't talking about doing the same work. We are talking about
doing different work. E.g. an ISA feature that does something like
creating cache snoopers for N memory addresses. E.g. transactional
memory. Anything like that, that involves snooping, is considerably
more hardware. E.g. doing AES encryption on every cache line going to
main memory. It may look like very little extra cost - for the snoopers,
because they are "in parallel"; for memory encryption, because big L3
caches reduce frequency of the operation, and because OOO tolerates the
extra latency. But for "the simplest possible implementation" the costs
are much higher.
Yes, HardBound comes right up to the edge of violating this "First, do
no harm" rule.
HardBound is especially attractive on an OOO machine, because an extra
latencies involved tend to get hidden.
HardBound is less attractive on the simplest possible implementation,
in-order, non-speculative. Less opportunity to latency hide.
But, observe: HardBound is identical to doing the same work in software
in the simplest possible implementation: in-order, non-pipelined,
For an unlimited width OOO superscalar, HardBound is again identical to
software. But for a finite width OOO superscalar, HardBound is faster
than software, since it requires less instruction bandwidth.
And for a simple (but not simplest) implementation - in-order, 1-wide -
HardBound again beats software.
Thus, HardBound does no harm compared to software doing the same work,
at the endpoints of the design space. And wins in the middle.
And HardBound does almost no harm in the middle of the design space,
compared to software that does LESS work, because it has no checking.
That about as close as you come to a win in the "First, do no harm" contest.
From: Robert Myers on 2 Oct 2009 01:37
On Oct 2, 12:56 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
> That about as close as you come to a win in the "First, do no harm" contest.
A virtual machine doesn't have to be stupid, though. Or, rather, the
user of the virtual machine doesn't have to be stupid. All the
unnecessary software cruft can be pushed off the stage like movable
scenery when it's not helpful or at least when it's extremely
harmful. That's one nice thing about virtual machines. They can
mimic reconfigurable hardware.
It sounds like the proposals are too expensive no matter how they are
implemented, hard or soft.
From: nmm1 on 2 Oct 2009 03:09
In article <4AC587E5.2020600(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>Robert Myers wrote:
>>> This is the ISA designer's equivalent of "First, do no harm."
>> Some of these conversations really confuse me.
>> In your "hardbound/softbound" thread, it was concluded (I thought)
>> that giving up a factor of two in execution speed was no big deal.
>Some of the responders to that thread said they would be happy to give
>up 2x performance to get a near-guarantee of no binary code injection
>via buffer overflow security holes.
That is true.
>And Nick and Wilco said that it could not be done, or was of no value,
That is not true. Yes, it can be done, and I said I had seen it done
(for a slightly higher factor, but with no hardware assistance).
What we were saying was that it can't be done in C, because there is
no consistent specification of C.