RISC load-store verses x86 Add from memory. [Computer Architecture]

Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?

From: Anton Ertl on 21 Jun 2010 08:54

"Ken Hagan" <K.Hagan(a)thermoteknix.com> writes:
>On Fri, 18 Jun 2010 20:28:37 +0100, R. Matthew Emerson <rme(a)clozure.com>
>wrote:
>
>> I gave a lightning talk about this at the recent International Lisp
>> Conference. The slides and one-pager from the proceedings are at
>> http://www.clozure.com/~rme/.
>
>Presumably the orthodox reply is that the micro-architecture is so
>divorced from the ISA that you'd get similar performance from emulating a
>RISCy load-store architecture. Use ESI to point to some "lisp registers",
>EDI to point to the "lisp stack" and the rest for fairly conventional
>purposes. The "registers" will be more or less resident in the L1 cache,
>along with recent parts of the "stack", and the L1 latency is so much
>lower than the memory wall that no-one notices.

That's a theory that is often presented. Unfortunately, it is
falsified in practice. Keeping frequently-used data in memory (even
in L1 cache) instead of in registers has a big cost on the IA-32 and
AMD64 implementations I have measured; In particular, look at Figure 3
of

http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl.pdf

The speedup (factor about 1.5) from 0.6.2 to 0.7.0ssc on the IA-32
machines (Xeon 5450, Opteron270, Pentium 4 Northwood, Athlon MP) is
due to allocating some of the virtual-machine registers in
real-machine registers instead of in memory.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: MitchAlsup on 21 Jun 2010 12:09

On Jun 21, 4:20 am, "Ken Hagan" <K.Ha...(a)thermoteknix.com> wrote:
> Mitch/nedbrek: Is this a fair summary of your position?

After giving this way to little thought:

My general inclination would be to put the LISP registers in static
memory (maybe .bss) and address them with abosolute memory reference
instructions. Then have the "compiler" figure out which ones were in
what registers and avoid reloading those already present. One of those
LISP registers would be pointing ot the LISP stack and since it is
used basically everywhere, it would end up residing the the machine
registers most of the time anyways. The data cache would keep the
accesses from becomming expensive, and this way you basically get an
infinite amount of useable registers for whatever pupposes the LISP
environment needs.

In the case this is a multithreaded LISP environment, some of these
register will remain in static store (the global ones) and others will
be migrated to a heap data structure with the active thread carrying
around a pointer to this heap 'register file' in a machine register.

But there are many sublties involved, and some history from the people
that created the original port that may completely outweigh my
preconceived notions.

The one thing I did take away from the presentation mentioned above,
is that there were way too many constants in registers in the original
incarnation. This is one of those things that happens when lots of
registers are available (essentially) for free. For machines like
IA32, one should essentially migrate the registerized data into the
data cache, and then decorate the 6 remaining registers with data
actively being manipulated, preallocating very few for LISP
environmental data.

Mitch

From: Terje Mathisen "terje.mathisen at on 22 Jun 2010 03:53

MitchAlsup wrote:
> The one thing I did take away from the presentation mentioned above,
> is that there were way too many constants in registers in the original
> incarnation. This is one of those things that happens when lots of
> registers are available (essentially) for free. For machines like
> IA32, one should essentially migrate the registerized data into the
> data cache, and then decorate the 6 remaining registers with data
> actively being manipulated, preallocating very few for LISP
> environmental data.

I agree, the key issues seemed to be large number of immediate values
(easily loaded from L1 when needed, particularly in the load-op
instruction format) and all the various pointers to LISP-specific data
areas.

In a multi-threaded implementation you need to communicate via memory
variables anyway, in which case the most important consideration is to
make sure all shared data structures that must be updated by multiple
threads are located in separate cache lines.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Andrew Reilly on 22 Jun 2010 04:33

On Sun, 20 Jun 2010 10:39:15 +0100, nmm1 wrote:
> So does Fortran and most comparable languages. On
> a better implementation, you would get a run-time error, of course.

Are x86'en capable of faulting on integer overflow? I suppose that they
must be, given that some languages mandate that behaviour. Since there's
only one add instruction that handles signed and unsigned, I wonder how
that works?

> Eh? That code's erroneous - wrong, buggy, defective.

Historical artifact. It's been in use for many years. Just broke with
the new/changed compiler.

> You shouldn't
> write code that overflows in C or Fortran. End of story. Sorry, but
> that is the situation, both de jure and de facto.

Unfortunately, that's not an answer I can make any use of. Signal
processing with fixed point arithmetic generally requires maximising SNR
(left-aligning values to the greatest extent possible) in algorithms that
often have unlikely extreme-range conditions. Generally, it is
preferable to clip than wrap-around, which is why all processors designed
for the purpose have saturating signed addition modes (or in-register
headroom and saturating store modes). While that makes for neat per-
processor optimised code, it's nice to have a vanilla-C fallback that
does the right thing, as a reference. That is now considerably uglier
than it needs to be/used to be.

Unfortunately-II: C is essentially the only language that one can count
on being available, in some form. I don't fancy joining the increasingly-
long line of folk who have created their own "fixed" C, but it seems that
that's not entirely outside the realms of necessity, at some stage.

Cheers,

--
Andrew

From: nmm1 on 22 Jun 2010 04:55

In article <88baqtFcceU1(a)mid.individual.net>,
Andrew Reilly <areilly---(a)bigpond.net.au> wrote:
>On Sun, 20 Jun 2010 10:39:15 +0100, nmm1 wrote:
>> So does Fortran and most comparable languages. On
>> a better implementation, you would get a run-time error, of course.
>
>Are x86'en capable of faulting on integer overflow? I suppose that they
>must be, given that some languages mandate that behaviour. Since there's
>only one add instruction that handles signed and unsigned, I wonder how
>that works?

Yes. It used to be easier, but has been made harder. If all else
fails, the compiler can generate code that does the check explicitly.
Been there - done that.

>> Eh? That code's erroneous - wrong, buggy, defective.
>
>Historical artifact. It's been in use for many years. Just broke with
>the new/changed compiler.

It's been broken for years, but the bug has only just exposed itself.
A very common effect.

>> You shouldn't
>> write code that overflows in C or Fortran. End of story. Sorry, but
>> that is the situation, both de jure and de facto.
>
>Unfortunately, that's not an answer I can make any use of.

Actually, you could, but it's tedious and inefficient. Implementing
saturating arithmetic using only correct C or Fortran isn't hard,
just painful. And your compiler may not be up to optimising it
well enough.

>Unfortunately-II: C is essentially the only language that one can count
>on being available, in some form. I don't fancy joining the increasingly-
>long line of folk who have created their own "fixed" C, but it seems that
>that's not entirely outside the realms of necessity, at some stage.

Right.

That is the excuse for the WG14 wild-eyes who want to perpetrate
TR 18037. While there is nothing wrong with adding saturating
arithmetic, fixed point, or both - in theory - C99's arithmetic model
is already ghastly almost beyond belief, and that complicates it by a
MASSIVE factor. And that's ignoring the other strand to that TR.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prev: Call for benchmarks: proposals by 30 June
Next: Vaporizing dust during chip manufacturing ?