From: Robert Myers on
On Dec 24, 4:34 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> I'm enjoying reading Hot Chips presentations.
>
> I'm - happy? chuffed? not surprised? interested? - to see one of the
> last bastions of RISC fall down.  Fujitsu has added an instruction
> prefix.  Albeit a 32 bit instruction prefix, not an 8 bit prefix like
> the amd x86-64 REX byte.  But same idea.  New register for the prefix state.
>
> Also specifies extended opcodes for new instructions.
>
> Pardon the mess, but I'll just cut and paste the text from the slide:
>
> Large register sets 2/2
>
> Instruction format for 256 FP registers
>
> 8 bit x 4 (3 read +1 write) register number fields are necessary for FMA
> (Floating-point Multiply and Add) instruction.
>
> But SPARC-V9 instruction length is limited (32bits –fixed)
>
> Defined a new prefix instruction (SXAR) to specify upper-3bit of
> register numbers of the following two instructions.
>
> SXAR
> inst1
> inst2
> Lower-5bit x4
> Upper-3bit x4
>
> SXAR (Set XAR) instruction
>
> XAR: Extended Arithmetic Register
> •Set by the SXAR instruction
> •Valid bit is cleared once the corresponding subsequent instruction gets
> executed.
>
> Operand fields of SXAR
> fv
> furd
> furs1
> furs2
> furs3
> sv
> surd
> surs1
> surs2
> surs3
> 31
> 0
> fsimd
> ssimd
> 16
> 15
> First Upper Register Source-1 bits
>
> SXAR1: set XAR for subsequent one instruction.
>
> SXAR2: set XAR for subsequent two instructions.

Does anyone still care about SPARC? If they do, that would be the
real news.

Robert.
From: MitchAlsup on
Having buit several of each RISC and CISC, this is totally
UNsurprising.

Overall, it only adds about 16 gates of total pipeline delay to have
byte level instruction lengths. Doing it at the word level cannot add
even this amount of gates. That is if a RISC design has 70 gates of
fall through delay, and x86 will have but 85-86. The fact that the
typical x86 has 256 gates of fall through delay cannot be blamed on
the instruction set! (But I digress)

In addition, last year I did some consultation with a company that was
considering adding a "payload" instruction to the instruction set. The
payload instruction carried a number of bits that other instructions
(already defined) could consume. You might use such a feature to carry
some more addressing bits, some register specifying bits, or some
instruciton set expanding bits. The payload instruction did not care
how the bits were consumed. (But again, I digress)

What we are arriving at is a point where we (the microarchitects and
implementers) have exploited all that is exploitable from the
architects of the past (6600, 360-91, 360-85, 360-67) in the context
of general purpose. If one looks at the distance in time between 360-
ISA introduction and the first RISC ISA introductions we have about 20
years. Now, it has been another 20 years and this keg seams tapped
out. In order to accrete that last modicum of performance for that
last application someone cares about, half a zillion instructions are
throw in. This is a sign that things are not well in architecture-
ville.

But, of course, the problem is not even in the instruction set, and
since the 1 million transistor level has not been. That is, as long as
the instructions that get created exist within the kinds of data-flow
the microarchitecture already supports; adding instructions is, for
all intents and purposes, (almost) free. It certainly takes more die
area to manage the data-flow than to manage the data-computations, so
to a first order, adding instructions is free (at the large end of
processor microarchitecture.) In addition, to a large extent, nobody
cares about the instruction set since compilers got "reasonably good".
As long as the programmer does not have to see the instructionset, why
should the customer care? This new foray into MAC-ville with 3-operand
instructions (sometimes with a 4-th destination) simply causes the
microarchitect to provide adequate register ports, and adequate
reservation station tracking. As long as this does not break the
camels back, its OK--not great but not worse than OK either. Just plan
for it and get on with life.

So, where are the instructions designed to allow the n-way
multiprocessor do synchronizations 10X faster than current? (OK, how
about 2X with guarenteed forward progress for at least one thread.)
This is really the kind of breakthrough that the large machines need.
(Where n is greater than 64) Even the scientific number chrunchers
would benefit from better synchronizations.

So, where are the new technologies to allow greater bandwidths to
greater memory with lesser latency? Say, 1 TB main memories with (say)
100 ns total latency average case (OK, maybe 150 ns total latency with
up to 64 nodes accessing the cabinet filled with DRAMs.)

Seems to me, that too many clever people doing the processors
(squeezing the last blood from the stone), and too few doing the
microarchitecture of the rest of the system (adding blood to the
stone).

Merry christmas

Mitch
From: Mayan Moudgill on
MitchAlsup wrote:

>
> So, where are the instructions designed to allow the n-way
> multiprocessor do synchronizations 10X faster than current? (OK, how
> about 2X with guarenteed forward progress for at least one thread.)

Synchronization is just one part of the communication between two CPUs;
its generally followed by a transfer of some amount of data. In many
cases, the data-transfer completely dominates this overall
communication, so the cost of the synchronization is in the noise.

Further, synchronization is done at the level of "processes", not
hardware. If a process happens to be swapped out or not yet ready to
synchronize, the wait time for the last processes to get to the
synchronization point will dominate the overall cost.

The overall performance impact on the program of improving the hardware
support for synchronization is, in IMO, generally going to be
insignificant.

Can you show studies to the contrary?

> This is really the kind of breakthrough that the large machines need.
> (Where n is greater than 64) Even the scientific number chrunchers
> would benefit from better synchronizations.

Are there any studies, particularily on non-micro-benchmark codes, that
would quantify this improvement?
From: nmm1 on
In article <NMmdnREyrJlroKvWnZ2dnUVZ_qOdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>MitchAlsup wrote:
>
>> So, where are the instructions designed to allow the n-way
>> multiprocessor do synchronizations 10X faster than current? (OK, how
>> about 2X with guarenteed forward progress for at least one thread.)
>
>Synchronization is just one part of the communication between two CPUs;
>its generally followed by a transfer of some amount of data. In many
>cases, the data-transfer completely dominates this overall
>communication, so the cost of the synchronization is in the noise.

For message-passing codes, perhaps. For the shared-memory parallel
paradigms that are currently trendy, not at all.

>Further, synchronization is done at the level of "processes", not
>hardware. If a process happens to be swapped out or not yet ready to
>synchronize, the wait time for the last processes to get to the
>synchronization point will dominate the overall cost.

If you are working with very coarse-grained parallelism, then I agree
hardware instructions are irrelevant.

>The overall performance impact on the program of improving the hardware
>support for synchronization is, in IMO, generally going to be
>insignificant.

Don't bet on it. What it does is to make it feasible to parallelise
the sort of program where the parallelism cannot be made coarse-grained,
or where there is potentially much more gain for fine-grained.

>Can you show studies to the contrary?

I could, once. I no longer have easy access to the relevant classes
of system.

>> This is really the kind of breakthrough that the large machines need.
>> (Where n is greater than 64) Even the scientific number chrunchers
>> would benefit from better synchronizations.
>
>Are there any studies, particularily on non-micro-benchmark codes, that
>would quantify this improvement?

Yes. How many have been published in places you can find them, or
even written up suitable for publication, I don't know. I know that
mine weren't.

Note that the situation involves more than just the synchronisation
operations, because a lot of it is about scheduling. If you are
trying to parallelise code with a 10 microsecond grain, having to do
ANY interaction with the system scheduler runs the risk of a major
problem. That is one of the main reasons that almost all HPC codes
rely on gang scheduling, with all threads running all the time.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:
> In article <NMmdnREyrJlroKvWnZ2dnUVZ_qOdnZ2d(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>MitchAlsup wrote:
>>
>>
>>>So, where are the instructions designed to allow the n-way
>>>multiprocessor do synchronizations 10X faster than current? (OK, how
>>>about 2X with guarenteed forward progress for at least one thread.)
>>
>>Synchronization is just one part of the communication between two CPUs;
>>its generally followed by a transfer of some amount of data. In many
>>cases, the data-transfer completely dominates this overall
>>communication, so the cost of the synchronization is in the noise.
>
>
> For message-passing codes, perhaps. For the shared-memory parallel
> paradigms that are currently trendy, not at all.
>
>

So core 1 writes some data, core 1&2 synchronize, and core 2 reads the
data. What actually happens post-synchronization?

Well, cache lines get copied from dcache-CPU-1 to dcache-CPU-2. This
takes time. This time will be proportional to the shared data. The cost
can actually be higher than in the case of an explicit message passing
system.

The synchronization, by contrast, can involve the tranfer of exactly one
cache-line [e.g. if you're doing an atomic-increment].

More heavyweight synchronization operations (such as a lock with suspend
on the lock if already locked) *can* be more expensive - but the cost
is due to all the additional function in the operation. Its not clear
that tweaking the underlying hardware primitives is going to do much for
this.