From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
>> Since integration is inevitable as well as obvious, inevitably we
>> will have more than one cache coherent domains on chip, which are PGAS
>> or MPI non-cache coherent between the domains.
>
> Extremely likely - nay, almost certain. Whether those domains will
> share an address space or not, it's hard to say. My suspicion is
> that they will, but there will be a SHMEM-like interface to them
> from their non-owning cores.

I'm using PGAS as my abbreviation for "shared memory, shared address
space, but not cache coherent, and not memory ordered". I realize,
though, that some people consider Cray SHMEM different from PGAS. Can
you suggest a more generic term?

Hmmm... "shared memory, shared address space, but not cache coherent,
and not memory ordered"
SM-SAS-NCC-NMO ?
No, needs a better name.

--

Let's see, if I have it right,

In strict PGAS (Private/Global Address Space) there are only two forms
of memory access:
1. local private memory, inaccessible to other processors
2. global shared memory, accessible by all other processors,
although implicitly accessible everywhere the same. Not local to anyone.

Whereas SHMEM allows more types of memory accesses, including
a. local memory, that may be shared with other processors
b. remote accesses to memory that is local to other processors
as well as remote access to memory that isn't local to anyone.
And potentially other memory types.

--

Some people, seem to assume that PGAS/SHMEM imply a special type of
programmatic memory access. E.g. Kathy Yelick, in one of her SC09
talks, said "PGAS gives programmers access to DMA controllers."

Maybe often so, but tain't necessarily so. There are several different
ways of "binding" such remote memory accesses to an instruction set so
that a programmer can use them, including:

The first two do not involve changes to the CPU microarchitecture:
a) DMA-style
b) Prefetch-style
The last involves making the CPU aware of remote memory
c) CPU-remote-aware


a) DMA-style - ideally user level, non-privileged, access to something
like a DMA engine. The main question is, how do you give user level
access to a DMA engine? Memory mapped command registers?
Virtualization issues. Queues? (Notification issues. E.g. interrupt on
completion? Not everyone has user level interrupts. (And even though x86
does, they are not frequently used.))

b) Prefetch-style - have the programmer issue a prefetch, somehow.
Later, allow the programmer to perform an access. If the prefetch is
complete, allow it. (Notification issues.)


Could be a normal prefetch instruction, that somehow bypasses the CPU
cache prefetch logic (e.g. because of address range.)

Or, the prefetch could be something like an uncached, UC, store:
UC-STORE
to: magic-address
data: packet containing PGAS address Aremote you want to load from

plus maybe a few other things in the store data packet - length, stride,
etc. Plus maybe the actual store data.


Later, you might do a load.

Possibly a real load: UC-LOAD from: PGAS address Aremote

or possibly a fake load, with a transformed address:
UC-LOAD hash(Aremote)


The load result may contain flags that indicate succes/failure/not yet
arrived.



Life would be particularly nice if your instruction set had operations
that allowed you to write out a store address and a data packet, and
then read from the same location, atomically. Yes, atomic RMWs. Like
in PCIe. Like in the processor CMPXCHG type instructions.


But, the big cost in all of this is that you probably need to make the
operations involved be UC, uncached. Anbd, because we have on x86 only
one main UC main memory type, used for legacy I/O, it is not optimized
for the usage models that PGAS/SHMEM expect.




c) Finally, one could make the CPU aware of PGAS/SHMEM remote accesses.
Possibly as new instructions. Or, possibly as a new memory type.

Now, it is a truism that x86 can't add new memory types. No more page
table bits. We'd rather add new instructions. I think this is bogus.

However, I have always liked the idea of being able to specify the
memory type on a per instruction basis. E.g. in x86, having a new
prefix applicable to memory instructions that says "The type of this
memory access is ...REMOTE-ordinary-memory..." Probably with combining
rules for the page tables and MTRR memory types.

If you come from another instruction set, perhaps like Sun's alternate
address space.

In either case, possibly wit the new memory type as a literal field in
the instruction, or possibly from a small set of registers.


If you allow normal memory instructions to access remote memory, and
then just use a memory type, then you could use the same libraries for
both local and remote: e.g. the same linked list routine could work in
both. Assuming itmade no assumptions about memory ordering that would
work in local but not in remote memory.

Is this worth doing?


I think that it is always a good idea to have the DMA style or prefetch
style interfaces. Particularly if on a RISC ISA that has no block
instructions like REP MOVS. Also if one wants to add extra
instructions for remote access that are not already in local memory.

But, the a) DMA-style and b) preftch-style interfaces are probabky
slower, for small accesses, on many common implementations. We can more
aggressively optimize the c) CPU-remote-aware.

Conversely, if you don't need it, you can always implement the
CPU-remote-aware in terms of the other two.

From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
>> Since integration is inevitable as well as obvious, inevitably we
>> will have more than one cache coherent domains on chip, which are PGAS
>> or MPI non-cache coherent between the domains.
>
> Extremely likely - nay, almost certain. Whether those domains will
> share an address space or not, it's hard to say. My suspicion is
> that they will, but there will be a SHMEM-like interface to them
> from their non-owning cores.

Actually, it's not an either/or choice. There aren't just two points on
the spectrum. We have already mentioned three, including the MPI space.
I like thinking about a few more:


1) SMP: shared memory, cache coherent, a relatively strong memory
ordering model like SC or TSO or PC. Typically writeback cache.

0) MPI: no shared memory, message passing

0.5) PGAS: shared memory, non-cache coherent. Typically UC, with DMA as
described in other posts.

0.9) SMP-WC: shared memory, cache coherent, a relatively weak memory
ordering model like RC or WC. Typically writeback cache.

0.8) ... with WT, writethrough, caches. Actually, it becomes a partial
order: there's WT-PC, and WT-WC.

0.7) SMP-WB-SWCO: non-cache-coherent, WB (or WT), with software managed
cache coherency via operations such as cache flushes.

I am particularly intrigued by the possibility of

0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually
coherent". Track which bytes have been written by a bitmask per cache
line. When evicting a cache line, evict with the bitmask, and
write-back only the written bytes. (Or words, if you prefer).

What I like about this is that it avoids one of the hardest aspects of
non-cache-coherent systems: (a) the fact that writes can disappear - not
just be observed in a different order, but actually disappear, and the
old data reappear (b) tied to cache line granularity.

Tracking bitmasks in this way means that you will never lose writes.

You may not know what order they get done in. There may be no global
order.

But you will never lose writes.


While we are at it

1.1) SMP with update cache protocols.




===



Sorting these according to "strength" - although, as I say above, there
are really some divergences, it is a partial order or lattice:

1.1) SMP with update cache protocols.

****
1) SMP: shared memory, cache coherent, a relatively strong memory
ordering model like SC or TSO or PC. Typically writeback cache.

0.9) SMP-WB-weak: shared memory, cache coherent, a relatively weak
memory ordering model like RC or WC. Typically writeback cache.

0.8) ... with WT, writethrough, caches.



0.7) SMP-WB-SWCO: non-cache-coherent, WB (or WT), with software managed
cache coherency via operations such as cache flushes

0.65) .. with WT

****???????
0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually
coherent". Track which bytes have been written by a bitmask per cache
line. When evicting a cache line, evict with the bitmask, and
write-back only the written bytes. (Or words, if you prefer).

0.55) ... with WT

****
0.5) PGAS: shared memory, non-cache coherent. Typically UC, with DMA as
described in other posts.

****
0) MPI: no shared memory, message passing




I've marked the models that I think are likely to be most important.

I think SMB-WB-bitmask is more likely to be important than the weak
models 0.7 and 0.9,
in part because I am in love with new ideas
but also because I think it scales better.

It provides the performance of conventional PGAS, but supports cache
locality when it is present. And poses none of the semantic challenges
of software managed cache coherency, although it has all of the same
performance issues.


Of ourse, it needs roghly 64 bits per cache line. Which may be enough
to kill it in its tracks.
From: Terje Mathisen on
Andy "Krazy" Glew wrote:
[interesting spectrum of distributed memory models snipped]
> I think SMB-WB-bitmask is more likely to be important than the weak
> models 0.7 and 0.9,
> in part because I am in love with new ideas
> but also because I think it scales better.
>
> It provides the performance of conventional PGAS, but supports cache
> locality when it is present. And poses none of the semantic challenges
> of software managed cache coherency, although it has all of the same
> performance issues.
>
>
> Of ourse, it needs roghly 64 bits per cache line. Which may be enough to
> kill it in its tracks.

Isn't this _exactly_ the same as the current setup on some chips that
use 128-byte cache lines, split into two sectors of 64 bytes each.

I.e. an effective cache line size that is smaller than the "real" line
size, taken to its logical end point.

I would suggest that (as you note) register size words is the smallest
item you might need to care about and track, so 8 bits for a 64-bit
platform with 64-byte cache lines, but most likely you'll have to
support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <4B270CA7.9060508(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>I'm using PGAS as my abbreviation for "shared memory, shared address
>space, but not cache coherent, and not memory ordered". I realize,
>though, that some people consider Cray SHMEM different from PGAS. Can
>you suggest a more generic term?

No, but that isn't what PGAS normally means. However, no matter.

>Let's see, if I have it right,
>
>In strict PGAS (Private/Global Address Space) there are only two forms
>of memory access:
> 1. local private memory, inaccessible to other processors
> 2. global shared memory, accessible by all other processors,
>although implicitly accessible everywhere the same. Not local to anyone.

I wasn't aware of that meaning. Its most common meaning at present
is Partitioned Global Address Space, with each processor owning some
memory but others being able to access it, possibly by the use of
special syntax. Very like some forms of SHMEM.

>Whereas SHMEM allows more types of memory accesses, including
> a. local memory, that may be shared with other processors
> b. remote accesses to memory that is local to other processors
>as well as remote access to memory that isn't local to anyone.
>And potentially other memory types.

Yes, and each use of SHMEM is different.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
Andy "Krazy" Glew wrote:

>
> In strict PGAS (Private/Global Address Space) there are only two forms
> of memory access:
> 1. local private memory, inaccessible to other processors
> 2. global shared memory, accessible by all other processors,
> although implicitly accessible everywhere the same. Not local to anyone.
>
> Whereas SHMEM allows more types of memory accesses, including
> a. local memory, that may be shared with other processors
> b. remote accesses to memory that is local to other processors
> as well as remote access to memory that isn't local to anyone.
> And potentially other memory types.
>

I can't see that there is any benefit between having strictly private
memory (PGAS 1. above), at least on a high-performance MP system.

The CPUs are going to access memory via a cache. I doubt that there will
be 2 separate kinds of caches, one for private and one for the rest of
the memory. So, as far as the CPUs are concerned there is no distinction.

Since the CPUs are still going to have to talk to a shared memory (PGAS
2. above), there will still be an path/controller between the bottom of
the cache hierarchy and the shared memory. This "controller" will have
to implement whatever snooping/cache-coherence/transfer protocol is
needed by the global memory.

The difference between shared local memory (SHMEM a) and strictly
private local memory (PGAS 1) is whether the local memory sits below the
memory controlller or bypasses it. Its not obvious (to me at least)
whether there are any benefits to be had by bypassing it. Can anyone
come up with something?
First  |  Prev  |  Next  |  Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Prev: PEEEEEEP
Next: Texture units as a general function