From: Mike Hayward on
I'm not sure who is working on block io these days, but hopefully an
active developer can steer this feedback toward folks who are as
interested in io performance as I am :-)

I've spent the last several years or so developing a user space
distributed storage system and I've recently gotten down to some io
performance tuning. Surprisingly, my results indicate that the
O_NONBLOCK flag produces no noticable effect on read or writev to a
Linux block device. I always perform aligned ios which are a multiple
of the sector size which also allows the use of O_DIRECT if desired.
For testing, I've been using 2.6.22 and 2.6.24 kernels (fedora core
and ubuntu distros) on both x86_64 and 32 bit arm architectures and
get similar results on every variation of hardware and kernel tested,
so I figure the behavior may still exist in the most recent kernels.

To extract the following data, I used the following set of system
calls in a loop driven by poll, surrounding read and write calls
immediately with time checks.

fd = open( filename, O_RDWR | O_NONBLOCK | O_NOATIME );
gettimeofday( &time, 0 );
read( fd, pos, len );
writev( fd, iov, count );
poll( pfd, npfd, timeoutms );

Byte counts are displayed in hex. On my core 2 duo laptop, for
example, io to or from the buffer cache typically takes 100 to 125
micro seconds to transfer 64k.

----------------------------------------------------------------------
BUFFER CACHE NOT FULL, NONBLOCKING 64K WRITES AS EXPECTED

write fd:3 0.000117s bytes:10000 remain:0
write fd:3 0.000115s bytes:10000 remain:0
write fd:3 0.000116s bytes:10000 remain:0
write fd:3 0.000118s bytes:10000 remain:0
write fd:3 0.000125s bytes:10000 remain:0
write fd:3 0.000126s bytes:10000 remain:0
write fd:3 0.000101s bytes:10000 remain:0

----------------------------------------------------------------------
READING AND WRITING, BUFFER CACHE FULL

read fd:3 0.006351s bytes:10000 remain:0
write fd:3 0.001235s bytes:200 remain:0
write fd:3 0.002477s bytes:200 remain:0
read fd:3 0.005010s bytes:10000 remain:0
write fd:3 0.001243s bytes:200 remain:0
read fd:3 0.005028s bytes:10000 remain:0
write fd:3 0.000506s bytes:200 remain:0
write fd:3 0.000106s bytes:10000 remain:0
write fd:3 0.000812s bytes:200 remain:0
write fd:3 0.000108s bytes:10000 remain:0
write fd:3 0.000807s bytes:200 remain:0
write fd:3 0.002652s bytes:200 remain:0
write fd:3 0.000107s bytes:10000 remain:0
write fd:3 0.000141s bytes:10000 remain:0
write fd:3 0.002232s bytes:200 remain:0

These are not worst-case, but rather best case results. For an
example of more worse case results, using a usb flash device,
frequently (about once a second or so) under heavier load I see reads
or writes blocked for 500ms or more when vmstat and top report more
than 90% idle / wait. 500ms to perform a 512 byte "non blocking" io
with a nearly idle cpu is an eternity in computer time; more than
10,000 times longer than it should take to memcpy all or even a
portion of the data or return EAGAIN.

I discovered this because, even though they succeed, all of these
"non" blocking system calls are blocking so much so that they easily
choke my process non blocking socket io. As a work around to this
failed attempt at nonblocking disk io, I now intend to implement a
somewhat more complex solution using aio or scsi generic to prevent
block device io from choking network io.

I think this O_NONBLOCK behavior has aspects that could probably be
classified as both a documentation and a kernel defect depending upon
whether the existing open(2) man page documents the intended behavior
of read and write or not.

If O_NONBLOCK is meaningful whatsoever (see man page docs for
semantics) against block devices, one would expect a nonblocking io
involving an unbuffered page to return either a partial result if a
prefix of the io can be completed immediately, or EAGAIN, schedule an
io against the device, then trigger a blocking select or poll type
call after the relevant page at the offending file descriptor cursor
becomes available in the buffer cache. The timing and results of each
read or write call speak for themselves. Specifying O_NONBLOCK does
not convert unbuffered ios to async buffer cache ios as expected;
typically blocking ios (i.e unbuffered reads or sustained writes to a
full, dirty buffer cache) definitely block in my app, whether or not
O_NONBLOCK is specified.

I've spent a tremendous amount of time building and benchmarking a
program based upon the Linux documentation for the previously
mentioned system calls only to find out the kernel doesn't behave as
specified. To save someone else from my fate, if O_NONBLOCK doesn't
prevent reads and writes to block devices from blocking, it should be
documented in the man page, and preferably also return an error when
supplied as a flag to open or fcntl for a block device. That's the
easy solution. The harder solution would be to make the system calls
actually be non blocking when O_NONBLOCK is specified.

Furthermore, I've also noticed these kernels also allow O_NONBLOCK and
O_DIRECT to be simultaneously specified against a block device even
though this is not logically even possible since, by definition, the
buffer cache is not involved and the process will have to wait for the
io to synchronously complete. This flag incompatibility should
probably be documented for clarity and it would be straight forward
for it to return an error if these contradictory behaviors are
simultaneously specified, unintentionally of course.

Thoughts anyone?

- Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox on
> If O_NONBLOCK is meaningful whatsoever (see man page docs for
> semantics) against block devices, one would expect a nonblocking io

It isn't...

The manual page says "When possible, the file is opened in non-blocking
mode" . Your write is probably not blocking - but the memory allocation
for it is forcing other data to disk to make room. ie it didn't block it
was just "slow".

O_NONBLOCK on a regular file does influence how it responds to leases and
mandatory locks.

> probably be documented for clarity and it would be straight forward
> for it to return an error if these contradictory behaviors are
> simultaneously specified, unintentionally of course.

and risk breaking existing apps.

> Thoughts anyone?


Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Hayward on
Hi Alan,

> > If O_NONBLOCK is meaningful whatsoever (see man page docs for
> > semantics) against block devices, one would expect a nonblocking io
>
> It isn't...

Thanks for the reply. It's good to get confirmation that I am not all
alone in an alternate non blocking universe. The linux man pages
actually had me convinced O_NONBLOCK would actually keep a process
from blocking on device io :-)

> The manual page says "When possible, the file is opened in non-blocking
> mode" . Your write is probably not blocking - but the memory allocation
> for it is forcing other data to disk to make room. ie it didn't block it
> was just "slow".

Even though I know quit well what blocking is, I am not sure how we
define "slowness". Perhaps when we do define it, we can also define
"immediately" to mean anything less than five seconds ;-)

You are correct that io to the disk is precisely what must happen to
complete, and last time I checked, that was the very definition of
blocking. Not only are writes blocking, even reads are blocking. The
docs for read(2) also says it will return EAGAIN if "Non-blocking I/O
has been selected using O_NONBLOCK and no data was immediately
available for reading."

There is no doubt the kernel is blocking the process whether or not
O_NONBLOCK is specified. Look again at the timings I sent; the flag
doesn't affect io at all. I think we can probably agree that reading
from an empty buffer cache should by definition return EAGAIN within a
few microseconds if it isn't going to block the process. But it
doesn't. I can easily make a process "run slowly" for an entire half
of a second or longer just trying to perform a 512 byte "non blocking"
read on a system with a virtually idle cpu.

Writing is no different from reading when the buffer cache cannot
immediately service either kind of request (i.e. all pages are dirty,
writing a page not in the cache, and there is no more free ram). If a
process can't run while the kernel performs io to a device to service
a writev call, it is by definition blocking said process. I certainly
concur that blocking is also both slow and not very immediate :-)

Why is blocking io an issue? As an example, time non blocking reads
to a drive and it takes say 5ms to return from a 64k read. Run
several processes simultaneously doing the same thing and it takes say
10ms to service each "non blocking" read request. Do a couple hundred
ios per second in each process and you'll soon find out your processes
(or threads) have nearly zero time at the cpu despite the fact that
the system is virtually idle and you are performing 100% "linux non
blocking" device io.

I've been doing unix io for a very long time and can assure you that
this is precisely why most high performance io applications use
asynchronous io libraries or multiple threads. It isn't that they are
necessarily compute intensive, but if read and write are going to
blocking your process, how else can you simultaneously execute ios to
different devices or perform computation while waiting on device io?

----------------------------------------------------------------------
There is currently and quite literally no point in specifying
O_NONBLOCK in Linux when opening a block device to affect anything
other than locking semantics, since it doesn't do anything.
----------------------------------------------------------------------

I'm not arguing that linux either should or should not provide non
blocking read and write calls, but pointing out that the documentation
claims it does when clearly O_NONBLOCK doesn't do anything related to
io, at least not with a block device. Probably it doesn't do anything
related to read or write against file systems either.

> > probably be documented for clarity and it would be straight forward
> > for it to return an error if these contradictory behaviors are
> > simultaneously specified, unintentionally of course.
>
> and risk breaking existing apps.

Changing anything risks breaking an app somewhere :-) You are right, I
completely agree it isn't appropriate to remove it since it's meaning
has been overloaded and it affects locking semantics with O_DIRECT.

Perhaps the man pages are partly derived from POSIX specs and non
blocking read and write calls are where linux eventually wants to be?
Updating the docs to describe it's actual behavior as it stands (or
rather, lack thereof) should be fairly low impact on existing apps.

How much effort do you think it would take to build consensus to
update the man pages? Accurate man pages don't really break code and
should really cut down on a lot of confusion, emails, and wasted
effort going forward. Do you think we should post a documentation
defect as opposed to a kernel defect?

- Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox on
> blocking. Not only are writes blocking, even reads are blocking. The
> docs for read(2) also says it will return EAGAIN if "Non-blocking I/O
> has been selected using O_NONBLOCK and no data was immediately
> available for reading."

The read case is more clearly blocking. We don't implement non blocking
disk I/O in that sense, although AIO sort of does and threads are very
cheap for I/O tasks.

> There is no doubt the kernel is blocking the process whether or not
> O_NONBLOCK is specified. Look again at the timings I sent; the flag
> doesn't affect io at all. I think we can probably agree that reading
> from an empty buffer cache should by definition return EAGAIN within a
> few microseconds if it isn't going to block the process.

That might make sense in its own way but there would then be no reason
for the I/O ever to complete. Non blocking tends to mean "don't wait for
some external non kernel event" (eg serial data arriving, hitting a
button)

> I've been doing unix io for a very long time and can assure you that
> this is precisely why most high performance io applications use
> asynchronous io libraries or multiple threads. It isn't that they are
> necessarily compute intensive, but if read and write are going to
> blocking your process, how else can you simultaneously execute ios to
> different devices or perform computation while waiting on device io?

The big challenge is that you may need to do disk I/O in many situations
you don't expect. Eg to find out which disk block in the cache you want
to see is available might require disk I/O itself.

You would end up with an implementation model in the kernel that was
essentially

if (O_NDELAY) {
try_op
if blocking create thread
}

which would badly underperform threading it in the first place.

Unix perhaps never got it entirely right, but we inherited that model.
VMS SYS$QIO v SYS$QIOW is a good deal more elegantly structured.

> claims it does when clearly O_NONBLOCK doesn't do anything related to
> io, at least not with a block device. Probably it doesn't do anything
> related to read or write against file systems either.

Correct - except for things like mandatory locks where it has a real
meaning.

> Perhaps the man pages are partly derived from POSIX specs and non
> blocking read and write calls are where linux eventually wants to be?
> Updating the docs to describe it's actual behavior as it stands (or
> rather, lack thereof) should be fairly low impact on existing apps.

I've not read the SuS entries on this for a while. There was some
discussion a while ago on what was needed to create an behaviour where
as soon as something blocked it created a thread that continued to
perform the I/O side and returned an error. It's not an easy problem to
solve and it's not clear that solving it is actually worth it versus
using threads and making sure our thread implemntation is fast and has
fast synchronization primitives.

> How much effort do you think it would take to build consensus to
> update the man pages? Accurate man pages don't really break code and
> should really cut down on a lot of confusion, emails, and wasted
> effort going forward. Do you think we should post a documentation
> defect as opposed to a kernel defect?

I would go one further... post a documentation patch to:
linux-man(a)vger.kernel.org for discussion merging.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: M vd S on
> > > If O_NONBLOCK is meaningful whatsoever (see man page docs for
> > > semantics) against block devices, one would expect a nonblocking io
> >
> > It isn't...
>
> Thanks for the reply. It's good to get confirmation that I am not all
> alone in an alternate non blocking universe. The linux man pages
> actually had me convinced O_NONBLOCK would actually keep a process
> from blocking on device io :-)
>

You're even less alone, I'm running into the same issue just now. But I
think I've found a way around it, see below.

> > The manual page says "When possible, the file is opened in non-blocking
> > mode" . Your write is probably not blocking - but the memory allocation
> > for it is forcing other data to disk to make room. ie it didn't
> block it
> > was just "slow".
>
> Even though I know quit well what blocking is, I am not sure how we
> define "slowness". Perhaps when we do define it, we can also define
> "immediately" to mean anything less than five seconds ;-)
>
> You are correct that io to the disk is precisely what must happen to
> complete, and last time I checked, that was the very definition of
> blocking. Not only are writes blocking, even reads are blocking. The
> docs for read(2) also says it will return EAGAIN if "Non-blocking I/O
> has been selected using O_NONBLOCK and no data was immediately
> available for reading."
>

The read(2) manpage reads, under NOTES:

"Many file systems and disks were considered to be fast enough that the
implementation of O_NONBLOCK was deemed unnecessary. So, O_NONBLOCK may
not be available on files and/or disks."

The statement ("fast enough") maybe only reflects the state of affairs
at that time - 10 ms seek time takes an eternity at 3 GHz, and times
100k it takes an eternity IRL as well. I would define "immediately" if
the data is available from kernel (or disk) buffers.

I need to do vast amounts (100k+) of scattered and unordered small reads
from harddisk and want to keep my seeks short through sorting them. I
have done some measurements and it seems perfectly possible to derive
the physical disk layout from statistics on some 10-100k random seeks,
so I can solve everything in userland. But before writing my own I/O
scheduler I'd thought to give the kernel and/or SATA's NCQ tricks a shot.

Now the problem is how to tell the kernel/disk which data I want without
blocking. readv(2) appearantly reads the requests in array order.
Multithreading doesn't sound too good for just this purpose.

posix_fadvise(2) sounds like something: "POSIX_FADV_WILLNEED initiates a
non-blocking read of the specified region into the page cache."
But there's appearantly no signalling to the process that an actual
read() will indeed not block.

readahead(2) blocks until the specified data has been read.

aio_read(2) appearantly doesn't issue a real non blocking read request,
so you will get the unneeded overhead of one thread per outstanding request.


mmap(2) / madvise(2) / mincore(2) may be a way around things (although
non-atomic), but I haven't tested it yet. It might also solve the
problem that started this thread, at least for the reading part of it.
Writing a small read() like function that operates through mmap()
doesn't seem too complicated. As for writing, you could use msync() with
MS_ASYNC to initiate a write. I'm not sure how to find out if a write
has indeed taken place, but at least initiating a non-blocking write is
possible. munmap() might then still block.

Maybe some guru here can tell beforehand if such an approach would work?

Cheers,
M.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/