From: Jamie Lokier on
Phillip Susi wrote:
> On 4/21/2010 12:12 PM, Jamie Lokier wrote:
> > Asynchronous is available: Use clone or pthreads.
>
> Synchronous in another process is not the same as async. It seems I'm
> going to have to do this for now as a workaround, but one of the reasons
> that aio was created was to avoid the inefficiencies this introduces.
> Why create a new thread context, switch to it, put a request in the
> queue, then sleep, when you could just drop the request in the queue in
> the original thread and move on?

Because tests have found that it's sometimes faster than AIO anyway!

....for those things where AIO is supported at all. The problem with
more complicated fs operations (like, say, buffered file reads and
directory operations) is you can't just put a request in a queue.

Some of it has to be done in a context with stack and occasional
sleeping. It's just too complicated to make all filesystem operations
_entirely_ async, and that is the reason Linux AIO has never gotten
very far trying to do that.

Those things where putting a request on a queue works tend to move the
sleepable metadata fetching to the code _before_ the request is queued
to get around that. Which is one reason why Linux O_DIRECT AIO can
still block when submitting a request... :-/

The most promising direction for AIO at the moment is in fact spawning
kernel threads on demand to do the work that needs a context, and
swizzling some pointers so that it doesn't look like threads were used
to userspace.

Kernel threads on demand, especially magical demand at the point where
the thread would block, are faster than clone() in userspace - but not
expected to be much faster if you're reading from cold cache anyway,
with lots of blocking happening.

You might even find that calling readahead() on *files* goes a bit
faster if you have several threads working in parallel calling it,
because of the ability to parallelise metadata I/O.

> > A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> > page_cache_sync_readahead. Doesn't that do any reading ahead? :-)
>
> Unfortunately it does not help when it is synchronous. The process
> still sleeps until it has fetched the blocks it needs. I believe that
> code just ends up doing a single 4kb read if the directory is no larger
> than that, or if it is, then it reads up to readahead_size. It puts the
> request in the queue then sleeps until all the data has been read, even
> if only the first 4kb was required before readdir() could return.

So you're saying it _does_ readahead_size if needed. That's great!
Evigny's concern about sequantially reading blocks one by one
isn't anything to care about then. That's one problem solved. :-)

> This means that a single thread calling readdir() is still going to
> block reading the directory before it can move on to trying to read
> other directories that are also needed.

Of course.

> > If not, fs/ext4/namei.c:ext4_dir_inode_operations points to
> > ext4_fiemap. So you may have luck calling FIEMAP or FIBMAP on the
> > directory, and then reading blocks using the block device. I'm not
> > sure if the cache loaded via the block device (when mounted) will then
> > be used for directory lookups.
>
> Yes, I had considered that. ureadahead already makes use of ext2fslibs
> to open the block device and read the inode tables so they are already
> in the cache for later use. It seems a bit silly to do that though,
> when that is exactly what readahead() SHOULD do for you.

Don't bother with FIEMAP then. It sounds like all the preloadable
metadata is already loaded. FIEMAP would have still needed to be
threaded for parallel directories.

Filesystem-independent readahead() on directories is out of the
question (except by using a kernel background thread, which is
pointless because you can do that yourself.)

Some filesystems have directories which aren't stored like a file's
data, and the process of reading the directory needs to work through
its logic, and needs a sleepable context to work in. Generic page
reading won't work for all of them.

readahead() on directories in specific filesystem types may be possible.
It would have to be implemented in each fs.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Phillip Susi wrote:
> On 4/21/2010 4:01 PM, Jamie Lokier wrote:
> > Ok, this discussion has got a bit confused. Text above refers to
> > needing to asynchronously read next block in a directory, but if they
> > are small then that's not important.
>
> It is very much important since if you ready each small directory one
> block at a time, it is very slow. You want to queue up reads to all of
> them at once so they can be batched.

I don't understand what you are saying at this point. Or you don't
understand what I'm saying. Or I didn't understand what Evigny was
saying :-)

Small directories don't _have_ next blocks; this is not a problem for
them. And you've explained that filesystems of interest already fetch
readahead_size in larger directories, so they don't have the "next
block" problem either.

> > That was my first suggestion: threads with readdir(); I thought it had
> > been rejected hence the further discussion.
>
> Yes, it was sort of rejected, which is why I said it's just a workaround
> for now until readahead() works on directories. It will produce the
> desired IO pattern but at the expense of ram and cpu cycles creating a
> bunch of short lived threads that go to sleep almost immediately after
> being created, and exit when they wake up. readahead() would be much
> more efficient.

Some test results comparing AIO with kernel threads indicate that
threads are more efficient than you might expect for this. Especially
in the cold I/O cache cases. readahead() has to do a lot of the same
work, in a different way and with less opportunity to parallelise the
metadata stage.

clone() threads with tiny stacks (you can even preallocate the stacks,
and they can be smaller than a page) aren't especially slow or big,
and ideally you'll use *long-lived* threads with an efficient
multi-consumer queue that they pull requests from, written to by the
main program and kept full enough to avoid blocking the threads.

Also since you're discarding the getdirentries() data, you can read
all of it into the same memory for hot cache goodness. (One per CPU
please.)

I don't know what performance that'll get you, but I think it'll be
faster than you are expecting - *if* the directory locking is
sufficiently scalable at this point. That's an unknown.

Try it with files if you want to get a comparative picture.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Evgeniy Polyakov wrote:
> On Wed, Apr 21, 2010 at 09:02:43PM +0100, Jamie Lokier (jamie(a)shareable.org) wrote:
> > FIEMAP might not be the answer, but what part of it requires fs
> > knowledge? It's supposed to be fs-independent. I agree it's not
> > always appropriate to use, and I don't know if it would be effective
> > anyway.
>
> At least we have to know whether given fs supports such interface.
> And more complex is to know how underlying fs is organized. What is
> extent, which types can it have, where exactly information about extent
> metadata is stored, i.e. where can we find what this object is about?

Ummm... Does any of that matter?

> And how to actually populate appropriate blocks into ram to speedup
> readdir()?

Blockdev readahead() :-)

> FIEMAP (which is file mapper btw :) is useful for information gathering
> about how fs is organized, but that's all I'm afraid.

That's all you need to start fetching from the blockdev. You can't
*use* the blockdev data, but that doesn't matter for this readahead
operation, only that they are approximately the right data blocks.

- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Phillip Susi on
On 4/21/2010 4:22 PM, Jamie Lokier wrote:
> Because tests have found that it's sometimes faster than AIO anyway!

Not when the aio is working properly ;)

This is getting a bit off topic, but aio_read() and readahead() have to
map the disk blocks before they can queue a read. In the case of ext2/3
this often requires reading an indirect block from the disk so the
kernel has to wait for that read to finish before it can queue the rest
of the reads and return. With ext4 extents, usually all of the mapping
information is in the inode so all of the reads can be queued without
delay, and the kernel returns to user space immediately.

So older testing done on ext3 likely ran into this and lead to the
conclusion that threading can be faster, but it would be preferable when
using ext4 with extents to drop the read requests in the queue without
the bother of setting up and tearing down threads, which is really just
a workaround for a shortcoming in aio_read and readahead() when using
indirect blocks. For that matter aio_read and readahead() could
probably benefit from some reworking to fix this so that they can return
as soon as they have queued the read of the indirect block, and queueing
the remaining reads can be deferred until the indirect block comes in.

> ...for those things where AIO is supported at all. The problem with
> more complicated fs operations (like, say, buffered file reads and
> directory operations) is you can't just put a request in a queue.

Unfortunately there aren't async versions of the calls that make
directory operations, but aio_read() performs a buffered file read
asynchronously just fine. Right now though I'm only concerned with
reading lots of data into the cache at boot time to speed things up.

> Those things where putting a request on a queue works tend to move the
> sleepable metadata fetching to the code _before_ the request is queued
> to get around that. Which is one reason why Linux O_DIRECT AIO can
> still block when submitting a request... :-/

Yep, as I just described. Would be nice to fix this.

> The most promising direction for AIO at the moment is in fact spawning
> kernel threads on demand to do the work that needs a context, and
> swizzling some pointers so that it doesn't look like threads were used
> to userspace.

NO! This is how aio was implemented at first and it was terrible.
Context is only required because it is easier to write the code linearly
instead of as a state machine. It would be better for example, to have
readahead() register a callback function to be called when the read of
the indirect block completes, and the callback needs zero context to
queue reads of the data blocks referred to by the indirect block.

> You might even find that calling readahead() on *files* goes a bit
> faster if you have several threads working in parallel calling it,
> because of the ability to parallelise metadata I/O.

Indeed... or you can use extents, or fix the implementation of
readahead() ;)

> So you're saying it _does_ readahead_size if needed. That's great!

I'm not sure, I'm just saying that if it does, it does not help much
since most directories fit in a single 4kb block anyhow. I need to get
a number of different directories read quickly.

> Filesystem-independent readahead() on directories is out of the
> question (except by using a kernel background thread, which is
> pointless because you can do that yourself.)

No need for a thread. readahead() does not need one for files, reading
the contents of a directory should be no different.

> Some filesystems have directories which aren't stored like a file's
> data, and the process of reading the directory needs to work through
> its logic, and needs a sleepable context to work in. Generic page
> reading won't work for all of them.

If the fs absolutely has to block that's ok, since that is no different
from the way readahead() works on files, but most of the time it
shouldn't have to and should be able to throw the read in the queue and
return.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Phillip Susi wrote:
> > ...for those things where AIO is supported at all. The problem with
> > more complicated fs operations (like, say, buffered file reads and
> > directory operations) is you can't just put a request in a queue.
>
> Unfortunately there aren't async versions of the calls that make
> directory operations, but aio_read() performs a buffered file read
> asynchronously just fine.

Why am I reading all over the place that Linux AIO only works with O_DIRECT?
Is it out of date? :-)

I admit I haven't even _tried_ buffered files with Linux AIO due to
the evil propaganda.

> > The most promising direction for AIO at the moment is in fact spawning
> > kernel threads on demand to do the work that needs a context, and
> > swizzling some pointers so that it doesn't look like threads were used
> > to userspace.
>
> NO! This is how aio was implemented at first and it was terrible.
> Context is only required because it is easier to write the code linearly
> instead of as a state machine. It would be better for example, to have
> readahead() register a callback function to be called when the read of
> the indirect block completes, and the callback needs zero context to
> queue reads of the data blocks referred to by the indirect block.

To read an indirect block, you have to allocate memory: another
callback after you've slept waiting for memory to be freed up.

Then you allocate a request: another callback while you wait for the
request queue to drain.

Then you submit the request: that's the callback you mentioned,
waiting for the result.

But then triple, double, single indirect blocks: each of the above
steps repeated.

In the case of writing, another group of steps for bitmap blocks,
inode updates, and heaven knows how fiddly it gets with ordered
updates to the journal, synchronised with other writes.

Plus every little mutex / rwlock is another place where you need those
callback functions. We don't even _have_ an async mutex facility in
the kernel. So every user of a mutex has to be changed to use
waitqueues or something. No more lockdep checking, no more RT
priority inheritance.

There are a _lot_ of places that can sleep on the way to a trivial
file I/O, and quite a lot of state to be past along the continuation
functions.

It's possible but by no means obvious that it's better.

I think people have mostly given up on that approach due to the how
much it complicates all the filesystem code, and how much goodness
there is in being able to call things which can sleep when you look at
all the different places. It seemed like a good idea for a while.

And it's not _that_ certain that it would be faster at high
loads after all the work.

A compromise where just a few synchronisation points are made async is
ok. But then it's a compromise... so you still need a multi-threaded
caller to keep the queues full in all situations.

> > Filesystem-independent readahead() on directories is out of the
> > question (except by using a kernel background thread, which is
> > pointless because you can do that yourself.)
>
> No need for a thread. readahead() does not need one for files, reading
> the contents of a directory should be no different.
>
> > Some filesystems have directories which aren't stored like a file's
> > data, and the process of reading the directory needs to work through
> > its logic, and needs a sleepable context to work in. Generic page
> > reading won't work for all of them.
>
> If the fs absolutely has to block that's ok, since that is no different
> from the way readahead() works on files, but most of the time it
> shouldn't have to and should be able to throw the read in the queue and
> return.

For specific filesystems, you could do it. readahead() on directories
is not an unreasonable thing to add on.

Generically is not likely. It's not about blocking, it's about the
fact that directories don't always consist of data blocks on the store
organised similarly to a file. For example NFS, CIFS, or (I'm not
sure), maybe even reiserfs/btrfs?

-- Jamie

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/