Intel and AMD RDMA implementation [Computer Architecture]

Prev: Simple Hack To Get $2500 To Your PayPal Account.
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))

From: Rick Jones on 23 Jul 2010 13:04

Andy Glew <"newsgroup at comp-arch.net"> wrote:
> In any case, however, DMA'ing between the I/O device and a staging
> area, and then betwen the staging area and ordinary memory, repeats
> operations unnecessarily. Avoiding the double copy by snooping
> caches usually far outweighs the cost of snooping.

Are there limits to that based on the number of things that must
take-part in the snooping?

rick jones
--
The glass is neither half-empty nor half-full. The glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

From: Andy Glew "newsgroup at on 23 Jul 2010 15:10

On 7/23/2010 10:23 AM, nmm1(a)cam.ac.uk wrote:
> In article<sMmdnbsyi8_eV9TRnZ2dnUVZ_gWdnZ2d(a)giganews.com>,
> Andy Glew<"newsgroup at comp-arch.net"> wrote:
>> On 7/22/2010 10:00 AM, nmm1(a)cam.ac.uk wrote:
>>> In article<loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>,
>>> Andy Glew<"newsgroup at comp-arch.net"> wrote:
>>
>>>> It is my understanding that the vast majority of all I/O DMAs, in terms
>>>> of bytes transferred, are into WB memory and are coherent. Reason: NC
>>>> sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or
>>>> back again. Which means you need a DMA copy engine. Which just puts
>>>> off the problem.
>>>
>>> Yes and no. Consider a Unix-like system (aren't they all, nowadays?)
>>> One sane implementation is to read blocks of data from disk into
>>> uncached memory, and the read and write calls then copy that (which
>>> they have to do anyway). So you have not lost anything.

>> In any case, however, DMA'ing between the I/O device and a staging area,
>> and then betwen the staging area and ordinary memory, repeats operations
>> unnecessarily. Avoiding the double copy by snooping caches usually far
>> outweighs the cost of snooping.
>
> And that's exactly what ISN'T the case, given my assumption! The point
> is that the software forces a copy between the staging area (which I
> was assuming could be written into directly by the device) and the
> cached memory visible to applications.
>
> Given that division of memory properties, what I said is the case;
> it's an old mainframe approach, after all. However, if that isn't
> the division used, well, then it isn't the case ....

In UNIX terms:

The disk read gets DMA'ed into the buffer cache, and the buffer cache
gets copied to user space. (Virtual memory remapping may be used in
some cases). Conversely for disk writes.

On most PCs: both user space and the buffer cache are ordinary writeback
memory. The I/O DMAs snoop the cache.

If you make the buffer cache UC, then the copy between user space and
the buffer cache is slow. But if you have a copy engine, or if you can
use USWC and that new instruction to make copies out of USWC fast, you
might try mapping the buffer cache USWC.

Sometimes people add yet another layer: they DMA to/from a staging
area, and then copy to the buffer cache, and then copy to user space. I
trust you can see how bad that is.

Occasionally people try to DNA directly into user space. E.g. big files.

The biggest problem is not the cache snooping, it is the fact that the
standard way of getting burst memory accesses is to use cacheable
memory. And the problem that the buffer cache entry is often touched
only once, during the copy, so it displaces useful data from the cache.

From: Andy Glew "newsgroup at on 23 Jul 2010 15:14

On 7/23/2010 10:04 AM, Rick Jones wrote:
> Andy Glew<"newsgroup at comp-arch.net"> wrote:
>> In any case, however, DMA'ing between the I/O device and a staging
>> area, and then betwen the staging area and ordinary memory, repeats
>> operations unnecessarily. Avoiding the double copy by snooping
>> caches usually far outweighs the cost of snooping.
>
> Are there limits to that based on the number of things that must
> take-part in the snooping?
>
> rick jones

There might conceivably be, but modern systems tend to have directories
or snoop filters. I.e. they only send snoops to systems that are highly
likely to contain the line.

The real problem is that buffer caches are often only touched once, so
that they displace useful data from the cache.

Of course, if coherency traffic is your bottleneck, as it was on some
Sun systems about a decade ago, then any reduction in coherency traffic
would be good.

Coherency traffic tends to become a bottleneck when MP systems are
formed by pasting together CPUs not really designed for big MP.

From: Rick Jones on 23 Jul 2010 15:39

Andy Glew <"newsgroup at comp-arch.net"> wrote:
> There might conceivably be, but modern systems tend to have
> directories or snoop filters. I.e. they only send snoops to systems
> that are highly likely to contain the line.

To see if I am understanding correctly, wouldn't that last sentence be
more accurate as "They only send snoops to systems that are not known
to not contain the line?"

rick jones
--
I don't interest myself in "why". I think more often in terms of
"when", sometimes "where"; always "how much." - Joubert
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

From: Andy Glew "newsgroup at on 23 Jul 2010 20:34

On 7/23/2010 12:39 PM, Rick Jones wrote:
> Andy Glew<"newsgroup at comp-arch.net"> wrote:
>> There might conceivably be, but modern systems tend to have
>> directories or snoop filters. I.e. they only send snoops to systems
>> that are highly likely to contain the line.
>
> To see if I am understanding correctly, wouldn't that last sentence be
> more accurate as "They only send snoops to systems that are not known
> to not contain the line?"

Your statement is more accurate on most systems, but not necessarily so.

There have been systems proposed that have snoop predictors: if, for
example, a line could be in two processor caches P1 or P2, but is
predicted to be only in one, P1, and if the snoop is sent to that one,
then if the line in that cache is in an exclusive state that indicates
that no other cache contains the line, there is no needto send the snoop
to P2. Whereas if the linewas in shared state, you might have to send a
snoop to P2.

But your statement is accurate for the systems I have seen that just
set a bit for a possible sharer. Where the bit if 0 says "definitely not
shared with this guy", but if 1 says "may be shared with this guy, but
may have been silently evicted".

Some snoop filters have the Bloom property, and reduce size by hashing
different cache lines together to share filter bits. Others do exact
tracking.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Simple Hack To Get $2500 To Your PayPal Account.
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))