From: Andy Glew "newsgroup at on
On 7/21/2010 8:38 AM, nmm1(a)cam.ac.uk wrote:
> Arising out of a course I am writing, I want to find out roughly
> how Intel and AMD handle I/O transfers to and from memory at the
> hardware level. So far, my searching has led nowhere beyond what
> I know, such as:
>
> The I/O controller uses the HyperTransport or QuickPath link
> to talk to the memory controller on the CPU that owns the memory.
> Well, I assume that, because anything else would be silly.
>
> But:
>
> Do they do those in a cache-coherent fashion, or is that
> independent of the cache?

I/O DMAs can be both cache coherent, or non coherent.

[UC] They can be directed to memory ranges that are never cached - which
I suppose is cache coherent in a manner.

[WB] They can be directed to memory ranges that are cached by the
processors. In so doing, the caches are kept coherent - which usually
means that the processor caches are snooped, flushed or invalidated, on
a line by line basis. (By the way, one of the issues was: "Is an I/O
DMA allowed to completely overwrite an M-state line without first
obtaining ownership? " We allowed this on P6 FSB, but I think that QPI
has just lost this ability.)

[NC] They can be directed to memory that can be cached, but for which
the I/O DMA transactions are not snooped, and so it is not coherent.
This is, I believe, HIGHLY DISCOURAGED, although every few years
somebody gets the bright idea to do this.

Much of this is configured in what used to be called the chipset, which
is now usually integrated. There is a plethora of range registers that
describe this.

I was quite surprised by Mitch's post, where I think he said that most
I/O DMAs are to UC uncached or NC non-coherent memory. This is not my
understanding. (I may have misunderstood Mitch's post.)

It is my understanding that the vast majority of all I/O DMAs, in terms
of bytes transferred, are into WB memory and are coherent. Reason: NC
sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or
back again. Which means you need a DMA copy engine. Which just puts
off the problem.

However, there may just be a terminology mismatch:

(a) although it is my understanding that the majority of I.O traffic is
DMAed in terms of bytes transferred
a.1) most disks
a.2) many network interfaces (ethernet, etc.), although uncached/NC
is more comon with NICs

(b) conversely, it may well be that many I/O devices are UC or NC -
because most I/O devices are not really important enough to have been
tuned.

(c) some of the highest performing I/O devices may use UC/NC, seeking to
reduce overhead. E.g. in supercomputers.

By the way, there is a relatively new motivation for NC: it saves power,
avoiding having to power on the CPU to snoop its caches




> Indeed, do they update the cache (as some systems used to) and,
> if so, up to which level?

For the most part, the caches are flushed or invalidated on a line by
line basis by I/O DMAs.

However, Intel has recently added DCA, Direct Cache Access:

http://www.intel.com/network/connectivity/vtc_ioat.htm

Direct Cache Access (DCA) allows a capable I/O device, such as a
network controller, to place data directly into CPU cache, reducing
cache misses and improving application response times.

I am only aware of DCA being used by NICs, although it might be used by
disk drives.

The earliest implementations of DCA were inefficient, and not always a
win. They may have improved by now.

I am not aware of public docs described which levels of the cache DCA
may insert into.

(By the way, there are several strategies here
0- don't snoop ccahes
1- snoop / invalidate / flush caches
2- update lines that already exist in the cache (but don't push new
lines in)
3- insert lines into cache
)








> Is there any public documentation on this? I am not looking
> for details, but enough information to be able to write reliable
> notes on advanced tuning.

I have found some of this information

a) in the chipset (not the processor) manuals. I'm reasonably certain I
have seen discussions of this in public AMD mannuals, as wel as Intel.

b) in the QPI book (Siingh, Safranek, et al)

c) in Open Source manuals



I wonder if the EMON performance counters could determine what fraction
of I/O DMA requests fall into which camp.


Heck: I just googled a bit, and found AMD's IOMMU,
http://support.amd.com/us/Embedded_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf

It has bits like

The FC bit in the page translation entry is used to specify if DMA
transactions that target the page must clear
the PCI-defined No Snoop bit. The state of this bit is returned to a
device with an IOTLB on an explicit
translation request. If FC=1 for an untranslated access, the IOMMU sets
the coherent bit in the upstream
HyperTransport� request packet. If FC=0 for an untranslated access, the
IOMMU passes upstream the
coherent attribute from the originating request.

which I think tends to imply that there is some possibility of coherent I/O.

From: MitchAlsup on
On Jul 22, 11:42 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> I was quite surprised by Mitch's post, where I think he said that most
> I/O DMAs are to UC uncached or NC non-coherent memory.  This is not my
> understanding.  (I may have misunderstood Mitch's post.)

I was only refering to buffers an OS may use to support a DMA device
with
limited address-bits so the I/O goes to a page in memory where I can
read
or write, and then the OS moves the page to its real location. The UC
info
may be out-of-date by now.

Mitch
From: nmm1 on
In article <loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>,
Andy Glew <"newsgroup at comp-arch.net"> wrote:

Thanks very much. That is very useful, even if not quite the same.

>[NC] They can be directed to memory that can be cached, but for which
>the I/O DMA transactions are not snooped, and so it is not coherent.
>This is, I believe, HIGHLY DISCOURAGED, although every few years
>somebody gets the bright idea to do this.

The ability of people to reinvent three-sided wheels is incredible.

>I/O DMAs are to UC uncached or NC non-coherent memory. This is not my
>understanding. (I may have misunderstood Mitch's post.)

One of us did, because that's not what I understood him to say.

>It is my understanding that the vast majority of all I/O DMAs, in terms
>of bytes transferred, are into WB memory and are coherent. Reason: NC
>sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or
>back again. Which means you need a DMA copy engine. Which just puts
>off the problem.

Yes and no. Consider a Unix-like system (aren't they all, nowadays?)
One sane implementation is to read blocks of data from disk into
uncached memory, and the read and write calls then copy that (which
they have to do anyway). So you have not lost anything.

The same applies when implementing MPI on top of an unhelpful
protocol, such as TCP/IP. I don't think that many people do use
uncached memory for that, though.

However, Intel has recently added DCA, Direct Cache Access:

> http://www.intel.com/network/connectivity/vtc_ioat.htm
>
> Direct Cache Access (DCA) allows a capable I/O device, such as a
> network controller, to place data directly into CPU cache, reducing
> cache misses and improving application response times.

Hmm. I wonder how many cards use that.


Regards,
Nick Maclaren.
From: Andy Glew "newsgroup at on
On 7/22/2010 10:00 AM, nmm1(a)cam.ac.uk wrote:
> In article<loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>,
> Andy Glew<"newsgroup at comp-arch.net"> wrote:

>> It is my understanding that the vast majority of all I/O DMAs, in terms
>> of bytes transferred, are into WB memory and are coherent. Reason: NC
>> sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or
>> back again. Which means you need a DMA copy engine. Which just puts
>> off the problem.
>
> Yes and no. Consider a Unix-like system (aren't they all, nowadays?)
> One sane implementation is to read blocks of data from disk into
> uncached memory, and the read and write calls then copy that (which
> they have to do anyway). So you have not lost anything.

That might be sane, except:

The copy from the uncached (UC memory type) area that the I/O DMA read
from the disk wrote to can either be done

a) by the CPU

b) by some sort of programmable copy engine.

Intel's I/O-AT (I/O advanced Technology?) makes the programmable copy
engine a bit more standard. Nuff said.

On current machines, having the CPU do that copy from the UC area to
ordinary memory is very, Very, VERY slow. As we have discussed here
previously, UC memory is just plain not optimized. There is no
distinction between UC memory that is ordinary memory, that could have
burst accesses, etc., and UC memory that has side effects - so the worst
case assumptions are made.

The USWC memory type could be used as a target to copy from CPU memory
into this I/O staging area. This makes transfers from ordinary memory
to staginbg area, to be subsequently written to disk, fast. I.e. it
makes disk writes fast(er). But doesn't help disk reads.

There have been recent steps to improve disk reads, or, rather, the
reads from an uncacheable area (actuallly USWC) that might be used.
this is mainly in the form of a new instruction. Whose mnemonic I can't
remember at the moment. (:-) I don't really like the implementation of
this instruction, but it helps.

In any case, however, DMA'ing between the I/O device and a staging area,
and then betwen the staging area and ordinary memory, repeats operations
unnecessarily. Avoiding the double copy by snooping caches usually far
outweighs the cost of snooping.
From: nmm1 on
In article <sMmdnbsyi8_eV9TRnZ2dnUVZ_gWdnZ2d(a)giganews.com>,
Andy Glew <"newsgroup at comp-arch.net"> wrote:
>On 7/22/2010 10:00 AM, nmm1(a)cam.ac.uk wrote:
>> In article<loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>,
>> Andy Glew<"newsgroup at comp-arch.net"> wrote:
>
>>> It is my understanding that the vast majority of all I/O DMAs, in terms
>>> of bytes transferred, are into WB memory and are coherent. Reason: NC
>>> sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or
>>> back again. Which means you need a DMA copy engine. Which just puts
>>> off the problem.
>>
>> Yes and no. Consider a Unix-like system (aren't they all, nowadays?)
>> One sane implementation is to read blocks of data from disk into
>> uncached memory, and the read and write calls then copy that (which
>> they have to do anyway). So you have not lost anything.
>
>That might be sane, except:
>
>The copy from the uncached (UC memory type) area that the I/O DMA read
>from the disk wrote to can either be done
>
>a) by the CPU
>
>b) by some sort of programmable copy engine.
>
>Intel's I/O-AT (I/O advanced Technology?) makes the programmable copy
>engine a bit more standard. Nuff said.

Right.

>On current machines, having the CPU do that copy from the UC area to
>ordinary memory is very, Very, VERY slow. As we have discussed here
>previously, UC memory is just plain not optimized. There is no
>distinction between UC memory that is ordinary memory, that could have
>burst accesses, etc., and UC memory that has side effects - so the worst
>case assumptions are made.

Yuck. All right, given that design, I take your point. I was
assuming just plain not cached and accessible to I/O devices.

>In any case, however, DMA'ing between the I/O device and a staging area,
>and then betwen the staging area and ordinary memory, repeats operations
>unnecessarily. Avoiding the double copy by snooping caches usually far
>outweighs the cost of snooping.

And that's exactly what ISN'T the case, given my assumption! The point
is that the software forces a copy between the staging area (which I
was assuming could be written into directly by the device) and the
cached memory visible to applications.

Given that division of memory properties, what I said is the case;
it's an old mainframe approach, after all. However, if that isn't
the division used, well, then it isn't the case ....


Regards,
Nick Maclaren.