From: nmm1 on
In article <i2nbhb$di0$1(a)usenet01.boi.hp.com>,
Rick Jones <rick.jones2(a)hp.com> wrote:
>Tim McCaffrey <timcaffrey(a)aol.com> wrote:
>> With Unix (and, honestly, Windows as well), the user owns the I/O
>> buffer. Which is kind of silly, since the I/O library typically has
>> to copy from/to that buffer from/to what ever variables, etc, are
>> used in the actual write/read.
>
>> If the buffer was under the OS's control (do a read, get a pointer
>> back, when you do the next read, the previous pointer is invalid).
>> Then the OS would be allowed to DMA directly into user space,
>> because that buffer would be under OS control. It would even allow
>> no-copy access to the disk buffer cache (with appropriate page table
>> remapping & access control).
>
>With O_DIRECT or an asynchronous interface, the "owned by the
>application" buffer can be used for the DMA.

Grrk. Not quite. I have forgotten the details, but the specification
is enough of a mess that it is both hard to implement and hard to use
correctly. That is a recipe for people saying "sod it - let's just
copy the buffer".


Regards,
Nick Maclaren.
From: nmm1 on
In article <i2p86q$uq8$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <i2nbhb$di0$1(a)usenet01.boi.hp.com>,
>Rick Jones <rick.jones2(a)hp.com> wrote:
>>Tim McCaffrey <timcaffrey(a)aol.com> wrote:
>>> With Unix (and, honestly, Windows as well), the user owns the I/O
>>> buffer. Which is kind of silly, since the I/O library typically has
>>> to copy from/to that buffer from/to what ever variables, etc, are
>>> used in the actual write/read.
>>
>>> If the buffer was under the OS's control (do a read, get a pointer
>>> back, when you do the next read, the previous pointer is invalid).
>>> Then the OS would be allowed to DMA directly into user space,
>>> because that buffer would be under OS control. It would even allow
>>> no-copy access to the disk buffer cache (with appropriate page table
>>> remapping & access control).
>>
>>With O_DIRECT or an asynchronous interface, the "owned by the
>>application" buffer can be used for the DMA.
>
>Grrk. Not quite. I have forgotten the details, but the specification
>is enough of a mess that it is both hard to implement and hard to use
>correctly. That is a recipe for people saying "sod it - let's just
>copy the buffer".

I had a memory failure causing cross-purposes. I was thinking POSIX.

O_DIRECT isn't POSIX, and the nearest equivalents are a mess, but
POSIX asynchronous I/O is the real mess. Not merely does it allow
the user to specify any accessible location as a buffer, which is
incompatible with most forms of DMA, it doesn't forbid the program
from reading the contents of a buffer with data being read into it
asynchronously. And appending is specified to occur in the order
of the aio_write calls, whatever that means, which is REALLY bad
news for some process/thread structures and schedulers.

I agree with both of you, in principle. If it's specified right, it
can be done. Just don't start from POSIX.


Regards,
Nick Maclaren.
From: Rob Warnock on
<nmm1(a)cam.ac.uk> wrote:
+---------------
| Rick Jones <rick.jones2(a)hp.com> wrote:
| >With O_DIRECT or an asynchronous interface, the "owned by the
| >application" buffer can be used for the DMA.
|
| Grrk. Not quite. I have forgotten the details, but the specification
| is enough of a mess that it is both hard to implement and hard to use
| correctly. That is a recipe for people saying "sod it - let's just
| copy the buffer".
+---------------

With O_DIRECT, the main restrictions -- at least on SGI Irix and Linux --
were that:

1. The filesystem had to support it. XFS did/does, even on Linux.

2. The user-mode buffer had to be page-aligned and of an integer
number of pages in length.

This is an "issue" for Linux only in that the !@^!$#!@ default
"malloc()" for Linux *always* sticks its own little invisible
header just before the block that is returned, so even if BUFSIZ
is a power-of-2, your stdio buffers are *always* misaligned!!
So under Linux you must ask for N+1 pages and then offset the
address you get back to point to the N page-aligned interior pages
before giving the buffer to setbuf()/setvbuf(). [Under SGI's Irix
(and FreeBSD, FWIW) there was a malloc()-related API call that you
could use to set the alignment of large malloc() calls.]

3. For maximum performance you really wanted to double-buffer your data
[yes, this meant not using stdio at all!], since the network drivers
used copy-on-write (COW) for large writes to avoid copying the data
to the TCP retransmission buffers. That is, on large page-aligned TCP
write()s the protocol stack [at least on Irix, dunno 'bout Linux] would
mark the pages as COW, and hold on to those now-readonly pages *as*
the retransmission buffers. If you did a subsequent O_DIRECT disk input
read into a buffer that was still being used as a TCP retransmission
buffer, that would break the COW and cause the buffer to be copied,
hurting performance. [Actually, the user pages got unmapped, reallocated
from fresh system VM memory, the data got copied from the still-COW
original pages, and the (new) user pages were remapped writeable into
the user process at the same virtual addresses as before. Yes, this
would require TLB shootdowns in at least two places. (*sigh*)]

Once the output data for a given TCP write() had been completely
acknowledged by the receiver, the protocol stack would un-COW the
buffer, so if the next O_DIRECT disk input read were delayed until
after that point, no copying would be needed to complete that read().

The result [on Irix] was that moving data off an XFS filesystem and out
to the network with TCP ran at ~500 MB/s (~4 Gb/s) circa 1998, with a
single TCP connection. This was true "zero copy" DMA for both disk and TCP.

I don't know how much of the TCP output optimization was done in the
Agami Systems filer circa 2006, but the latter could supply NFS read
data to multiple clients on multiple GbEs at an aggregate rate of ~1 GB/s
(~8 Gb/s) off the disks (XFS variant which had O_DIRECT) and out the wires.

So the benefits of O_DIRECT were/are (IMHO) worth the slight hassle
of arranging your apps to follow the above few restrictions.


-Rob

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Terje Mathisen "terje.mathisen at on
Rob Warnock wrote:
> The result [on Irix] was that moving data off an XFS filesystem and out
> to the network with TCP ran at ~500 MB/s (~4 Gb/s) circa 1998, with a
> single TCP connection. This was true "zero copy" DMA for both disk and TCP.

This is the Right Stuff, accept no substitutes.
>
> I don't know how much of the TCP output optimization was done in the
> Agami Systems filer circa 2006, but the latter could supply NFS read
> data to multiple clients on multiple GbEs at an aggregate rate of ~1 GB/s
> (~8 Gb/s) off the disks (XFS variant which had O_DIRECT) and out the wires.

All the way back in 1993 (or possibly 94?) Novell showed off how a
single Netware-386 server could send independent/random streaming
full-screen video to 64 simultaneous clients: At that point in time Fast
Ethernet was 100 Mbit/s, so they had 21-22 clients on each of 3 network
segments and ran at nearly wire speed on all of them.

This worked because the RAID controller would DMA disk data directly
into the buffer cache, and the client data read requests would pick it
up from the same place.
>
> So the benefits of O_DIRECT were/are (IMHO) worth the slight hassle
> of arranging your apps to follow the above few restrictions.

For network traffic you also need the ability to specify packets as a
list of fragments, so that you can have packet headers and trailers in
one location and the actual data in another.

Novell used async Event Control Blocks for all such operations, where
you would effectively hand over the fragment list to the OS and either
get a callback or just check a status flag to know when it was safe to
reuse the ECB and corresponding fragments.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: George Neuner on
On Wed, 28 Jul 2010 19:28:54 +0100 (BST), nmm1(a)cam.ac.uk wrote:

>POSIX asynchronous I/O is the real mess. Not merely does it allow
>the user to specify any accessible location as a buffer, which is
>incompatible with most forms of DMA, it doesn't forbid the program
>from reading the contents of a buffer with data being read into it
>asynchronously. And appending is specified to occur in the order
>of the aio_write calls, whatever that means ...
It's a sequential model.

Preventing concurrent access isn't really possible unless the buffer
page is mapped out of the process during DMA. You can make an
argument about that either way - on one hand remapping the buffer
makes asynch IO safe; but OTOH, page granularity could be a problem to
a program that needs a lot of small buffers. Besides which the API
provides synchronization, so a properly written program should not be
accessing a buffer that's in transit.

>... which is REALLY bad news for some process/thread structures
>and schedulers.

Not really following this. Concurrent access again?

George