From: Phillip Susi on
On 4/21/2010 12:12 PM, Jamie Lokier wrote:
> Asynchronous is available: Use clone or pthreads.

Synchronous in another process is not the same as async. It seems I'm
going to have to do this for now as a workaround, but one of the reasons
that aio was created was to avoid the inefficiencies this introduces.
Why create a new thread context, switch to it, put a request in the
queue, then sleep, when you could just drop the request in the queue in
the original thread and move on?

> A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> page_cache_sync_readahead. Doesn't that do any reading ahead? :-)

Unfortunately it does not help when it is synchronous. The process
still sleeps until it has fetched the blocks it needs. I believe that
code just ends up doing a single 4kb read if the directory is no larger
than that, or if it is, then it reads up to readahead_size. It puts the
request in the queue then sleeps until all the data has been read, even
if only the first 4kb was required before readdir() could return.

This means that a single thread calling readdir() is still going to
block reading the directory before it can move on to trying to read
other directories that are also needed.

> I/O is the probably the biggest cost, so it's more important to get
> the I/O pattern you want than worrying about return values you'll discard.

True, but it would be nice not to waste cpu cycles copying unneeded data
around.

> If not, fs/ext4/namei.c:ext4_dir_inode_operations points to
> ext4_fiemap. So you may have luck calling FIEMAP or FIBMAP on the
> directory, and then reading blocks using the block device. I'm not
> sure if the cache loaded via the block device (when mounted) will then
> be used for directory lookups.

Yes, I had considered that. ureadahead already makes use of ext2fslibs
to open the block device and read the inode tables so they are already
in the cache for later use. It seems a bit silly to do that though,
when that is exactly what readahead() SHOULD do for you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Phillip Susi on
On 4/21/2010 2:51 PM, Jamie Lokier wrote:
> Fwiw, I found sorting directories by inode and reading them in that
> order help to reduce seeks, some 10 years ago. I implemented
> something like 'find' which works like that, keeping a queue of
> directories to read and things to open/stat, ordered by inode number
> seen in d_ino before open/stat and st_ino after. However it did not
> try to readahead the blocks inside a directory, or sort operations by
> block number. It reduced some 'find'-like operations to about a
> quarter of the time on cold cache. I still use that program sometimes
> before "git status" ;-) Google "treescan" and "lokier" if you're
> interested in trying it (though I use 0.7 which isn't published).

That helps with open()ing or stat()ing the files since you access the
inodes in order, but ureadahead already preloads all of the inode tables
so this won't help.

>> it is not about readdir(). Plain read() is synchronous too. But
>> filesystem can respond to readahead calls and read next block to current
>> one, while it won't do this for next direntry.
>
> I'm surprised it makes much difference, as directories are usually not
> very large anyway.

That's just it; it doesn't help. That's why I want to readahead() all
of the directories at once instead of reading them one block at a time.

> But if it does, go on, try FIEMAP and blockdev reading, you know you
> want to :-)

Why reinvent the wheel when that's readahead()'s job? As a workaround
I'm about to try just threading all of the calls to open(). Each one
will queue a read and block, but with them all doing so at once should
fill the queue with plenty of reads. It is inefficient, but better than
one block at a time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Phillip Susi wrote:
> On 4/21/2010 2:51 PM, Jamie Lokier wrote:
> > Fwiw, I found sorting directories by inode and reading them in that
> > order help to reduce seeks, some 10 years ago. I implemented
> > something like 'find' which works like that, keeping a queue of
> > directories to read and things to open/stat, ordered by inode number
> > seen in d_ino before open/stat and st_ino after. However it did not
> > try to readahead the blocks inside a directory, or sort operations by
> > block number. It reduced some 'find'-like operations to about a
> > quarter of the time on cold cache. I still use that program sometimes
> > before "git status" ;-) Google "treescan" and "lokier" if you're
> > interested in trying it (though I use 0.7 which isn't published).
>
> That helps with open()ing or stat()ing the files since you access the
> inodes in order, but ureadahead already preloads all of the inode tables
> so this won't help.

It helps a little with data access too, because of block group
locality tending to follow inode numbers. Don't read inodes and data
in the same batch though.

> >> it is not about readdir(). Plain read() is synchronous too. But
> >> filesystem can respond to readahead calls and read next block to current
> >> one, while it won't do this for next direntry.
> >
> > I'm surprised it makes much difference, as directories are usually not
> > very large anyway.
>
> That's just it; it doesn't help. That's why I want to readahead() all
> of the directories at once instead of reading them one block at a time.

Ok, this discussion has got a bit confused. Text above refers to
needing to asynchronously read next block in a directory, but if they
are small then that's not important.

> > But if it does, go on, try FIEMAP and blockdev reading, you know you
> > want to :-)
>
> Why reinvent the wheel when that's readahead()'s job? As a workaround
> I'm about to try just threading all of the calls to open().

FIEMAP suggestion is only if you think you need to issue reads for
multiple blocks in the _same_ directory in parallel. From what you say,
I doubt that's important.

FIEMAP is not relevant for reading different directories in parallel.
You'd still have to thread the FIEMAP calls for that - it's a
different problem.

> Each one will queue a read and block, but with them all doing so at
> once should fill the queue with plenty of reads. It is inefficient,
> but better than one block at a time.

That was my first suggestion: threads with readdir(); I thought it had
been rejected hence the further discussion.

(Actually I would use clone + open + getdirentries + tiny userspace
stack to avoid using tons of memory. But that's just a tweak, only to
be used if the threading is effective.)

-- Jamie

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Phillip Susi on
On 4/21/2010 4:01 PM, Jamie Lokier wrote:
> Ok, this discussion has got a bit confused. Text above refers to
> needing to asynchronously read next block in a directory, but if they
> are small then that's not important.

It is very much important since if you ready each small directory one
block at a time, it is very slow. You want to queue up reads to all of
them at once so they can be batched.

> FIEMAP suggestion is only if you think you need to issue reads for
> multiple blocks in the _same_ directory in parallel. From what you say,
> I doubt that's important.

That may be why you suggested it, but it is also exactly what
readahead() does. It also queues the read asynchronously which is what
I really want so that I can queue more reads on other directories in one
big batch.

> That was my first suggestion: threads with readdir(); I thought it had
> been rejected hence the further discussion.

Yes, it was sort of rejected, which is why I said it's just a workaround
for now until readahead() works on directories. It will produce the
desired IO pattern but at the expense of ram and cpu cycles creating a
bunch of short lived threads that go to sleep almost immediately after
being created, and exit when they wake up. readahead() would be much
more efficient.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Evgeniy Polyakov on
On Wed, Apr 21, 2010 at 09:02:43PM +0100, Jamie Lokier (jamie(a)shareable.org) wrote:
> FIEMAP might not be the answer, but what part of it requires fs
> knowledge? It's supposed to be fs-independent. I agree it's not
> always appropriate to use, and I don't know if it would be effective
> anyway.

At least we have to know whether given fs supports such interface.
And more complex is to know how underlying fs is organized. What is
extent, which types can it have, where exactly information about extent
metadata is stored, i.e. where can we find what this object is about?
And how to actually populate appropriate blocks into ram to speedup readdir()?

FIEMAP (which is file mapper btw :) is useful for information gathering
about how fs is organized, but that's all I'm afraid.

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/