From: Phillip Susi on
I have been trying to improve boot times with ureadahead and have
identified a period of time of almost zero IO throughput caused by calls
to open() blocking during name lookup while fetching directory blocks.
At first I thought this could simply be fixed by calling readahead() on
the directories first before open()ing all of the normal files for
readahead.

Unfortunately it seems that readahead() fails when called on a
directory. I was wondering if I could get some help understanding why
this is, and how it could be fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Phillip Susi wrote:
> I have been trying to improve boot times with ureadahead and have
> identified a period of time of almost zero IO throughput caused by calls
> to open() blocking during name lookup while fetching directory blocks.
> At first I thought this could simply be fixed by calling readahead() on
> the directories first before open()ing all of the normal files for
> readahead.
>
> Unfortunately it seems that readahead() fails when called on a
> directory. I was wondering if I could get some help understanding why
> this is, and how it could be fixed.

readahead() doesn't make much sense on a directory - the offset and
size aren't meaningful.

But does plain opendir/readdir/closedir solve the problem?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Phillip Susi on
On 4/21/2010 6:06 PM, Jamie Lokier wrote:
> Why am I reading all over the place that Linux AIO only works with O_DIRECT?
> Is it out of date? :-)

Dunno, where did you read that? If you are using O_DIRECT then you
really should be using aio or you will suffer a pretty heavy performance
loss from all of the sleeping, but strictly speaking the two do not have
to be used together.

Personally I wish there was another flag besides O_DIRECT that split the
two semantics O_DIRECT now carries. Right now it FORCES the cache to be
bypassed and the IO to go to the disk, even if it's already in the
cache. It would be nice if you could ask a read to done such that IF
it's already cached, then copy it from there, otherwise, send the read
direct down to the disk to dma into my buffer.

> To read an indirect block, you have to allocate memory: another
> callback after you've slept waiting for memory to be freed up.

You allocate the cache pages in the initial readahead() before
returning. No need to do it from the bio completion callback.

> Then you allocate a request: another callback while you wait for the
> request queue to drain.

Same thing. Get everything set up and ready to go in readahead() and
the only thing that has to wait on the indirect block to be read is
filling in the block addresses of the bios and submitting them. This
last part can be done in the bio completion callback.

As an added optimization, you only need to allocate one bio in
readahead() since it is likely that only one will be needed if all of
the blocks are sequential. Then the callback can use the gfp_mask flags
to prevent allocations from sleeping and if more can not be allocated,
then you sumbit what you've got and when THAT completes, you try to
build more requests.

> Plus every little mutex / rwlock is another place where you need those
> callback functions. We don't even _have_ an async mutex facility in
> the kernel. So every user of a mutex has to be changed to use
> waitqueues or something. No more lockdep checking, no more RT
> priority inheritance.

Yes, it looks like ext4_get_blocks() does use mutexes so it can't be
called from bh context. Perhaps it could be changed to avoid this if
possible and if it must, return -EWOULDBLOCK and the completion callback
would have to punt to a work queue to retry. In the common case though,
it looks like it would be possible for ext4_get_blocks() to avoid using
mutexes and just parse the newly read indirect block and return, then
the completion callback can queue its bios and be done.

> A compromise where just a few synchronisation points are made async is
> ok. But then it's a compromise... so you still need a multi-threaded
> caller to keep the queues full in all situations.

Right, which tends to negate most of the gains of having any async at
all. For example, if we have multiple threads calling readahead()
instead of just one since it may sleep reading an indirect block, then
we can end up with this:

Thread 1 queues reads of the first 12 blocks of the first file, and the
indirect block. Thread then sleeps waiting for the indirect block.

Thread 2 queues reads of the first 12 blocks of the second file and its
indirect block. Thread then sleeps waiting for the indirect block.

Now we have the disk read 12 contiguous blocks of data + indirect from
the first file, then 12 contiguous blocks of data + indirect from the
second file, which are further down the disk, so the head has to seek
forward. Then thread 1 wakes up and parses the indirect block and
queues reading of the subsequent sectors, which now requires a backwards
seek since we skipped reading those sectors to move ahead to the second
file.

So in our attempt to use threads to keep the queue full, we have
introduced more seeking, which tends to have a higher penalty than just
using a single thread and having the queue drain and the disk idle for a
few ns while we wake up and parse the indirect block.

Come to think of it, I guess that is a good argument NOT to make
readahead() fully async.

> Generically is not likely. It's not about blocking, it's about the
> fact that directories don't always consist of data blocks on the store
> organised similarly to a file. For example NFS, CIFS, or (I'm not
> sure), maybe even reiserfs/btrfs?

The contents are stored in blocks somewhere. It doesn't really matter
how or where as long as the fs figures out what it will need to resolve
names in that directory and reads that into the cache.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Phillip Susi on
On 4/22/2010 1:53 PM, Jamie Lokier wrote:
> Right, but finding those blocks is highly filesystem-dependent which
> is why making it a generic feature would need support in each filesystem.

It already exists, it's called ->get_blocks(). That's how readahead()
figures out which blocks need to be read.

> support FIEMAP on directories should work. We're back to why not do
> it yourself then, as very few programs need directory readahead.

Because there's already a system call to accomplish that exact task; why
reinvent the wheel?

> If you're interested, try finding all the places which could sleep for
> a write() call... Note that POSIX requires a mutex for write; you
> can't easily change that. Reading is easier to make fully async than
> writing.

POSIX doesn't say anything about how write() must be implemented
internally. You can do without mutexes just fine. A good deal of the
current code does use mutexes, but does not have to. If your data is
organized well then the critical sections of code that modify it can be
kept very small, and guarded with either atomic access functions or a
spin lock. A mutex is more convenient since it it allows you to have
much larger critical sections and sleep, but we don't really like having
coarse grained locking in the kernel.

> Then readahead() isn't async, which was your request... It can block
> waiting for memory and other things when you call it.

It doesn't have to block; it can return -ENOMEM or -EWOULDBLOCK.

> Exactly. And making it so it _never_ blocks when called is a ton of
> work, more lines of code (in C anyway), a maintainability nightmare,
> and adds some different bottlenecks you've not thought off. At this
> point I suggest you look up the 2007 discussions about fibrils which
> are quite good: They cover the overheads of setting up state for async
> calls when unnecessary, and the beautiful simplicty of treating stack
> frames as states in their own right.

Sounds like an interesting compromise. I'll look it up.

> No: In that particular case, waiting while the indirect block is
> parsed is advantageous. But suppose the first indirect block is
> located close to the second file's data blocks. Or the second file's
> data blocks are on a different MD backing disk. Or the disk has
> different seeking characteristics (flash, DRBD).

Hrm... true, so knowing this, defrag could lay out the indirect block of
the first file after the first 12 blocks of the second file to maintain
optimal reading. Hrm... I might have to try that.

> I reckon the same applies to your readahead() calls: A queue which you
> make sure is always full enough that threads never block, sorted by
> inode number or better hints where known, with a small number of
> threads calling readahead() for files, and doing whatever is useful
> for directories.

Yes, and ureadahead already orders the calls to readahead() based on
disk block order. Multithreading it leads the problem with backward
seeks right now but a tweak to the way defrag lays out the indirect
blocks, should fix that. The more I think about it the better this idea
sounds.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Phillip Susi wrote:
> > If you're interested, try finding all the places which could sleep for
> > a write() call... Note that POSIX requires a mutex for write; you
> > can't easily change that. Reading is easier to make fully async than
> > writing.
>
> POSIX doesn't say anything about how write() must be implemented
> internally. You can do without mutexes just fine.

POSIX requires concurrent, overlapping writes don't interleave the
data (at least, I have read that numerous times), which is usually
implemented with a mutex even though there are other ways.

Many implementations relax this for O_DIRECT, because it's non-POSIX
and concurrent writes are good.

> A good deal of the current code does use mutexes, but does not have
> to. If your data is organized well then the critical sections of
> code that modify it can be kept very small, and guarded with either
> atomic access functions or a spin lock. A mutex is more convenient
> since it it allows you to have much larger critical sections and
> sleep, but we don't really like having coarse grained locking in the
> kernel.

The trickier stuff in proper AIO is sleeping waiting for memory to be
freed up, sleeping waiting for a rate-limited request queue entry
repeatedly, prior to each of the triple, double, single indirect
blocks, which you then sleep waiting to complete, sleeping waiting for
an atime update journal node, sleeping on requests and I/O on every
step through b-trees, etc... That's just reads; writing adds just as
much again. Changing those to async callbacks in every
filesystem - it's not worth it and it'd be a maintainability
nightmare. We're talking about changes to the kernel
memory allocator among other things. You can't gfp_mask it away -
except for readahead() because it's an abortable hint.

Oh, and fine-grained locking makes the async transformation harder,
not easier :-)

> > Then readahead() isn't async, which was your request... It can block
> > waiting for memory and other things when you call it.
>
> It doesn't have to block; it can return -ENOMEM or -EWOULDBLOCK.

For readahead yes because it's just an abortable hint.
For general AIO, no.

> > No: In that particular case, waiting while the indirect block is
> > parsed is advantageous. But suppose the first indirect block is
> > located close to the second file's data blocks. Or the second file's
> > data blocks are on a different MD backing disk. Or the disk has
> > different seeking characteristics (flash, DRBD).
>
> Hrm... true, so knowing this, defrag could lay out the indirect block of
> the first file after the first 12 blocks of the second file to maintain
> optimal reading. Hrm... I might have to try that.
>
> > I reckon the same applies to your readahead() calls: A queue which you
> > make sure is always full enough that threads never block, sorted by
> > inode number or better hints where known, with a small number of
> > threads calling readahead() for files, and doing whatever is useful
> > for directories.
>
> Yes, and ureadahead already orders the calls to readahead() based on
> disk block order. Multithreading it leads the problem with backward
> seeks right now but a tweak to the way defrag lays out the indirect
> blocks, should fix that. The more I think about it the better this idea
> sounds.

Ah, you didn't mention defragging for optimising readahead before.

In that case, just trace the I/O done a few times and order your
defrag to match the trace, it should handle consistent patterns
without special defrag rules. I'm surprised it doesn't already.
Does ureadahead not use prior I/O traces for guidance?

Also, having defragged readahead files into a few compact zones, and
gotten the last boot's I/O trace, why not readahead those areas of the
blockdev first in perfect order, before finishing the job with
filesystem operations? The redundancy from no-longer needed blocks is
probably small compared with the gain from perfect order in few big
zones, and if you store the I/O trace of the filesystem stage every
time to use for the block stage next time, the redundancy should stay low.

Just a few ideas.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/