Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: "Andy "Krazy" Glew" on 15 Dec 2009 20:57

Mayan Moudgill wrote:
> I can't see that there is any benefit between having strictly private
> memory (PGAS 1. above), at least on a high-performance MP system.
>
> The CPUs are going to access memory via a cache. I doubt that there will
> be 2 separate kinds of caches, one for private and one for the rest of
> the memory. So, as far as the CPUs are concerned there is no distinction.
>
> Since the CPUs are still going to have to talk to a shared memory (PGAS
> 2. above), there will still be an path/controller between the bottom of
> the cache hierarchy and the shared memory. This "controller" will have
> to implement whatever snooping/cache-coherence/transfer protocol is
> needed by the global memory.
>
> The difference between shared local memory (SHMEM a) and strictly
> private local memory (PGAS 1) is whether the local memory sits below the
> memory controller or bypasses it. Its not obvious (to me at least)
> whether there are any benefits to be had by bypassing it. Can anyone
> come up with something?

Nick is right: the P in PGAS stands for partitioned, not private. For
some reason, I keep making this confusion.

(Pictures such as slide 4 in
http://groups.google.com/group/scaling-to-petascale-workshop-2009/web/introduction-to-pgas-languages?pli=1
are, perhaps, one source of my confusion, since Snir definitely
depicts private/global,not partitioned.)

Mayan is right: the main motivation in having private memory is whether
you want to bypass any cache. Believe it or not, many HPC people do not
want to have any cache whatsoever. I agree with Mayan: we will
definitely cache local accesses because uncached and we probably don't
want to create special cases for remote memory. That being said, I will
admit that I have been thinking about special protocols for global
memory, such as described in the previous post.

I suppose that one of the reasons I have been thinking of private as
opposed to partitioned has been thinking about languages that have
"private" and " global" keywords. This is a smaller addition to the
language than adding a placement syntax. The question then is whether
you can convert a pointer to private T into a pointer to public T. UPC
seems to disallow of this.

Even if in the implementation in hardware private and global memory
locations are cached in the same way, it may be desirable to distinguish
some of the language level: the compiler may be able to use more
efficient synchronization mechanisms for variables that are guaranteed
to be local private than it can use for global variables that might be
local or might be remote and might be shared with other processors.
Typically, on X86 the local variables may not require fencing because
the X86's default strong memory ordering, whereas fences may be
required for global variables because the global interconnect may not
provide the snooping mechanisms that processors such as the P6 family
use to enforce strong memory ordering. Note that these fences may not
be the standard LFENCE, SFENCE, or MFENCE instructions, since those are
typically not externally visible. Instead they might have to be
expensive UC memory accesses, so that the are visible to the outside
world. Of course it would be wonderful to create new versions of the
fence instructions that could be visible to external memory fabric. But
if you go down that path you might actually end up having to distinguish
private and global memory.

- - -

(I am writing this in the Seattle to Portland van, bouncing on the rough
roads. It is quite remarkable how much slower the computer is when
there is this much vibration. I fear that my heads are crashing all the
time. I really need to save up the money to get myself a solid state
disk. Also, as I have noted before, speech recognition works better in
this fight vibration environment than keyboarding, with handwriting
recognition in between. This is the first time I've actually used
speech recognition in the van with somebody else present, except for
Monday when I was matching a person who was talking loudly on the cell
phone. I hope that I'm not disturbing the other passenger. I hope that
she will tell me honestly if I am, and not just be polite. I'm curious
to find out if speech recognition is socially acceptable in such
relatively high noise environments as the shuttle van or an airplane. I
hope that it is less obnoxious that speaking on a cell phone. Of
course, the impoliteness of talking on a cell phone does not stop many
people doing it. I suspect that dictating text is better than listening
to a cell phone, because I dictate in full sentences; but listening to
me edit text is probably even more than listening to a cell phone. I
am falling into an odd hybrid of using speech to dictate and editing
with the pen.)

From: Mayan Moudgill on 16 Dec 2009 07:04

nmm1(a)cam.ac.uk wrote:

>
> In particular, using a common cache with different coherence
> protocols for different parts of it has been done, but has never
> been very successful.

There is a distinction between choosing between two different coherence
protocols and between a simpler coherent/not-coherent memory.

At the hardware level, this would be a choice between running MOESI
(or whatever MESI variant is being used) when running with coherence and
imnmediately promoting a line from S/O to M for purposes of writes (for
non-coherence); you'd use instruction control (cache flush, e.g.) or
write-through to guarantee its visibility to the outside world.
Following tradition, this would probably be controlled by bits in the
page-table.

So, its demonstrably simple to *implement* coherence/non-coherence. If
the lack of success is because it is difficult to use in an MP context,
that is a different issue.

>
>>>The main advantage of truly private memory, rather than incoherent
>>>sharing across domains, is reliability. You can guarantee that it
>>>won't change because of a bug in the code being run on another
>>>processor.
>>
>>If I wanted to absolutely guarantee that, I would put the access control
>>in the memory controller. If I wanted to somewhat guarantee that, I
>>would use the VM access right bits.
>
>
> Doubtless you would. And that is another example of what I said
> earlier. That does not "absolutely guarantee" that - indeed, it
> doesn't even guarantee it, because it still leaves the possibility
> of a privileged process on another processor accessing the pseudo-
> local memory. And, yes, I have seen that cause trouble.

Absolutely guarantee would imply a control register in the memory
controller with a bit that, if set, ensures that the only write (or
write and read) requests the memory controller allows through are those
from its "owning" processor. That is why the absolute guarantee is part
of the controller.

As you correctly pointed out, a VM based scheme fails in the presence of
bugs. Which is why I called it a "somewhat guarantee" exclusivity model.

From: Bernd Paysan on 18 Dec 2009 16:51

Andy "Krazy" Glew wrote:
> 1) SMP: shared memory, cache coherent, a relatively strong memory
> ordering model like SC or TSO or PC. Typically writeback cache.
>
> 0) MPI: no shared memory, message passing

You can also have shared "write-only" memory. That's close to the MPI
side of the tradeoffs. Each CPU can read and write its own memory, but
can only write remote memories. The pro side is that all you need is a
similar infrastructure to MPI (send data packets around), and thus it
scales well; also, there are no blocking latencies.

The programming model can be closer do data flow than pure MPI, since
when you only pass data, writing the data to the target destination is
completely sufficient. An "this data is now valid" message might be
necessary (or some log of the memory controller where each CPU can
extract what regions were written to).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: "Andy "Krazy" Glew" on 16 Dec 2009 17:13

Andy "Krazy" Glew wrote:
> Mayan Moudgill wrote:
>> I can't see that there is any benefit between having strictly private
>> memory (PGAS 1. above), at least on a high-performance MP system.
>>
>> The CPUs are going to access memory via a cache. I doubt that there
>> will be 2 separate kinds of caches, one for private and one for the
>> rest of the memory. So, as far as the CPUs are concerned there is no
>> distinction.
>>
>> Since the CPUs are still going to have to talk to a shared memory
>> (PGAS 2. above), there will still be an path/controller between the
>> bottom of the cache hierarchy and the shared memory. This "controller"
>> will have to implement whatever snooping/cache-coherence/transfer
>> protocol is needed by the global memory.

> Even if in the implementation in hardware private and global memory
> locations are cached in the same way, it may be desirable to distinguish
> some of the language level: the compiler may be able to use more
> efficient synchronization mechanisms for variables that are guaranteed
> to be local private than it can use for global variables that might be
> local or might be remote and might be shared with other processors.

I mentioned the possibility of fencing being different for local/
private memory and for global memory.

I forgot to mention the possibility of software controlled cache coherence.

If the compiler has to emit cache flush directives around accesses to
global memory that is cached, and if these directives are as slow as on
present X86, then the compiler definitely warts to know what is private
and what is not.

IMHO this is a good reason to use the DMA model.

If flushing cache is slow, then you may want to distinguish private
memory that can be cached, e.g. in your 2M/ core L3 cache, from remote
cacheable memory_ caching the latter in a smaller, cheaper to flush,
structure.

From: nmm1 on 22 Dec 2009 15:56

In article <4B2FEF58.4040907(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>Bernd Paysan wrote:
>>
>> You can also have shared "write-only" memory. That's close to the MPI
>> side of the tradeoffs. Each CPU can read and write its own memory, but
>> can only write remote memories. The pro side is that all you need is a
>> similar infrastructure to MPI (send data packets around), and thus it
>> scales well; also, there are no blocking latencies.
>>
>> The programming model can be closer do data flow than pure MPI, since
>> when you only pass data, writing the data to the target destination is
>> completely sufficient. An "this data is now valid" message might be
>> necessary (or some log of the memory controller where each CPU can
>> extract what regions were written to).
>
>At first I liked this, and then I realized what I liked was the idea of
>being able to create linked data structures, readable by anyone, but
>only manipulated by the local node - except for the minimal operations
>necessary to link new nodes into the data structure.

Yes, that's a model I have liked for some time. I should be very
interested to know why Bernd regards the other way round as better;
I can't see it, myself, but can't convince myself that it isn't.

>I don't think that ordinary read/write semantics are acceptable. I
>think that you need the ability to "atomically" (for some definition of
>atomic - all atomicity is relative) read a large block of data. Used by
>a node A to read a data node in node B's memory.

I agree, but the problem has been solved for file-systems, where
snapshots are implemented in such a way as to appear to give such
atomic read semantics.

Actually, what I like is the database/BSP semantics. Updates are
purely local, until the owner says "commit", when all other nodes
will see the new structure when they next say "accept". Before
that, they see the old structure. Details of whether commit and
accept should be directed or global are topics for research ....

I think that it could be done fairly easily at the page level,
using virtual memory primitives, but not below unless the cache
line ones were extended.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Prev: PEEEEEEP
Next: Texture units as a general function