fs: icache lock i_count [Kernel]

Prev: x86: enlightenment for ticket spin locks - eliminate NOPs introduced by first patch
Next: drm fixes + agp + one fb patch (bisected)

From: Frank Mayhar on 1 Jul 2010 12:30

On Thu, 2010-07-01 at 17:54 +1000, Nick Piggin wrote:
> On Thu, Jul 01, 2010 at 12:36:18PM +1000, Dave Chinner wrote:
> > Seriously: use a new lock for high level inode operations you are
> > optimising - don't repurpose an existing lock with different usage
> > rules just because it's convenient.
>
> That's what scalability development is all about, I'm afraid. Just
> adding more and more locks is what makes things more complex, so
> you have to juggle around or change locks when possible. If there is a
> difficulty with locking pops up in future, I'd prefer to look at it
> then.

I must agree strongly with Nick here. Lock proliferation is a Bad
Thing; adding a new lock here just adds complexity without really
improving anything, since you still have to deal with lock ordering
_and_ add a new one to the mix. It also makes struct inode bigger for
no real gain. Since i_lock is already well understood and its use is
pretty isolated, it seems ideal to extend it to more general use while
keeping the isolation intact as much as possible. Hell, if the code
were designed with this kind of scalability in mind, i_lock would be
doing all the locking directly related to struct inode. Much as it is
after Nick's work.

My argument here is that it's not just convenient, it makes the
implementation simpler since instead of dealing with two locks there's
just one.
--
Frank Mayhar <fmayhar(a)google.com>
Google, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 2 Jul 2010 22:10

On Wed, 30 Jun 2010 22:05:02 +1000 Nick Piggin <npiggin(a)suse.de> wrote:

> On Wed, Jun 30, 2010 at 05:27:02PM +1000, Dave Chinner wrote:
> > On Thu, Jun 24, 2010 at 01:02:41PM +1000, npiggin(a)suse.de wrote:
> > > Protect inode->i_count with i_lock, rather than having it atomic.
> > > Next step should also be to move things together (eg. the refcount increment
> > > into d_instantiate, which will remove a lock/unlock cycle on i_lock).
> > .....
> > > Index: linux-2.6/fs/inode.c
> > > ===================================================================
> > > --- linux-2.6.orig/fs/inode.c
> > > +++ linux-2.6/fs/inode.c
> > > @@ -33,14 +33,13 @@
> > > * inode_hash_lock protects:
> > > * inode hash table, i_hash
> > > * inode->i_lock protects:
> > > - * i_state
> > > + * i_state, i_count
> > > *
> > > * Ordering:
> > > * inode_lock
> > > * sb_inode_list_lock
> > > * inode->i_lock
> > > - * inode_lock
> > > - * inode_hash_lock
> > > + * inode_hash_lock
> > > */
> >
> > I thought that the rule governing the use of inode->i_lock was that
> > it can be used anywhere as long as it is the innermost lock.
> >
> > Hmmm, no references in the code or documentation. Google gives a
> > pretty good reference:
> >
> > http://www.mail-archive.com/linux-ext4(a)vger.kernel.org/msg02584.html
> >
> > Perhaps a different/new lock needs to be used here?
>
> Well I just changed the order (and documented it to boot :)). It's
> pretty easy to verify that LOR is no problem. inode hash is only
> taken in a very few places so other code outside inode.c is fine to
> use i_lock as an innermost lock.

um, nesting a kernel-wide singleton lock inside a per-inode lock is
plain nutty.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 2 Jul 2010 23:50

On Fri, Jul 02, 2010 at 07:03:55PM -0700, Andrew Morton wrote:
> On Wed, 30 Jun 2010 22:05:02 +1000 Nick Piggin <npiggin(a)suse.de> wrote:
>
> > On Wed, Jun 30, 2010 at 05:27:02PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 24, 2010 at 01:02:41PM +1000, npiggin(a)suse.de wrote:
> > > > Protect inode->i_count with i_lock, rather than having it atomic.
> > > > Next step should also be to move things together (eg. the refcount increment
> > > > into d_instantiate, which will remove a lock/unlock cycle on i_lock).
> > > .....
> > > > Index: linux-2.6/fs/inode.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/fs/inode.c
> > > > +++ linux-2.6/fs/inode.c
> > > > @@ -33,14 +33,13 @@
> > > > * inode_hash_lock protects:
> > > > * inode hash table, i_hash
> > > > * inode->i_lock protects:
> > > > - * i_state
> > > > + * i_state, i_count
> > > > *
> > > > * Ordering:
> > > > * inode_lock
> > > > * sb_inode_list_lock
> > > > * inode->i_lock
> > > > - * inode_lock
> > > > - * inode_hash_lock
> > > > + * inode_hash_lock
> > > > */
> > >
> > > I thought that the rule governing the use of inode->i_lock was that
> > > it can be used anywhere as long as it is the innermost lock.
> > >
> > > Hmmm, no references in the code or documentation. Google gives a
> > > pretty good reference:
> > >
> > > http://www.mail-archive.com/linux-ext4(a)vger.kernel.org/msg02584.html
> > >
> > > Perhaps a different/new lock needs to be used here?
> >
> > Well I just changed the order (and documented it to boot :)). It's
> > pretty easy to verify that LOR is no problem. inode hash is only
> > taken in a very few places so other code outside inode.c is fine to
> > use i_lock as an innermost lock.
>
> um, nesting a kernel-wide singleton lock inside a per-inode lock is
> plain nutty.

I think it worked out OK. Because the kernel-wide locks are really
restricted in where they are to be used (ie. not in filesystems). So
they're really pretty constrained to the inode management subsystem.
So filesystems still get to really use i_lock as an inner most lock
for their purposes.

And filesystems get to take i_lock and stop all manipulation of inode
now. No changing of flags, moving it on/off lists etc behind its back.
It really is about locking the data rather than the code.

The final ordering outcome looks like this:

* inode->i_lock
* inode_list_lglock
* zone->inode_lru_lock
* wb->b_lock
* inode_hash_bucket lock

And it works like that because when you want to add or remove an inode
from various data structures, you take the i_lock and then take each
of these locks in turn, inside it. The alternative is to build a bigger
lock ordering graph, and take all the locks up-front before taking
i_lock. I did actaully try that and it ended up being worse, so I went
this route.

I think taking a global lock in mark_inode_dirty is nutty (especially
when that global lock is shared with hash management, LRU scanning,
writeback, i_flags access... :) It's just a question of which is less
nutty.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 3 Jul 2010 00:40

On Sat, 3 Jul 2010 13:41:23 +1000 Nick Piggin <npiggin(a)suse.de> wrote:

> On Fri, Jul 02, 2010 at 07:03:55PM -0700, Andrew Morton wrote:
> > On Wed, 30 Jun 2010 22:05:02 +1000 Nick Piggin <npiggin(a)suse.de> wrote:
> >
> > > On Wed, Jun 30, 2010 at 05:27:02PM +1000, Dave Chinner wrote:
> > > > On Thu, Jun 24, 2010 at 01:02:41PM +1000, npiggin(a)suse.de wrote:
> > > > > Protect inode->i_count with i_lock, rather than having it atomic.
> > > > > Next step should also be to move things together (eg. the refcount increment
> > > > > into d_instantiate, which will remove a lock/unlock cycle on i_lock).
> > > > .....
> > > > > Index: linux-2.6/fs/inode.c
> > > > > ===================================================================
> > > > > --- linux-2.6.orig/fs/inode.c
> > > > > +++ linux-2.6/fs/inode.c
> > > > > @@ -33,14 +33,13 @@
> > > > > * inode_hash_lock protects:
> > > > > * inode hash table, i_hash
> > > > > * inode->i_lock protects:
> > > > > - * i_state
> > > > > + * i_state, i_count
> > > > > *
> > > > > * Ordering:
> > > > > * inode_lock
> > > > > * sb_inode_list_lock
> > > > > * inode->i_lock
> > > > > - * inode_lock
> > > > > - * inode_hash_lock
> > > > > + * inode_hash_lock
> > > > > */
> > > >
> > > > I thought that the rule governing the use of inode->i_lock was that
> > > > it can be used anywhere as long as it is the innermost lock.
> > > >
> > > > Hmmm, no references in the code or documentation. Google gives a
> > > > pretty good reference:
> > > >
> > > > http://www.mail-archive.com/linux-ext4(a)vger.kernel.org/msg02584.html
> > > >
> > > > Perhaps a different/new lock needs to be used here?
> > >
> > > Well I just changed the order (and documented it to boot :)). It's
> > > pretty easy to verify that LOR is no problem. inode hash is only
> > > taken in a very few places so other code outside inode.c is fine to
> > > use i_lock as an innermost lock.
> >
> > um, nesting a kernel-wide singleton lock inside a per-inode lock is
> > plain nutty.
>
> I think it worked out OK. Because the kernel-wide locks are really
> restricted in where they are to be used (ie. not in filesystems). So
> they're really pretty constrained to the inode management subsystem.
> So filesystems still get to really use i_lock as an inner most lock
> for their purposes.
>
> And filesystems get to take i_lock and stop all manipulation of inode
> now. No changing of flags, moving it on/off lists etc behind its back.
> It really is about locking the data rather than the code.
>
> The final ordering outcome looks like this:
>
> * inode->i_lock
> * inode_list_lglock
> * zone->inode_lru_lock
> * wb->b_lock
> * inode_hash_bucket lock

Apart from the conceptual vandalism, it means that any contention times
and cache transfer times on those singleton locks will increase
worst-case hold times of our nice, fine-grained i_lock.

> And it works like that because when you want to add or remove an inode
> from various data structures, you take the i_lock

Well that would be wrong. i_lock protects things *within* its inode.
It's nonsensical to take i_lock when the inode is being added to or
removed from external containers because i_lock doesn't protect those
containers!

> and then take each
> of these locks in turn, inside it. The alternative is to build a bigger
> lock ordering graph, and take all the locks up-front before taking
> i_lock. I did actaully try that and it ended up being worse, so I went
> this route.
>
> I think taking a global lock in mark_inode_dirty is nutty (especially
> when that global lock is shared with hash management, LRU scanning,
> writeback, i_flags access... :) It's just a question of which is less
> nutty.

Yes, inode_lock is bad news.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 3 Jul 2010 01:10

On Fri, Jul 02, 2010 at 09:31:49PM -0700, Andrew Morton wrote:
> On Sat, 3 Jul 2010 13:41:23 +1000 Nick Piggin <npiggin(a)suse.de> wrote:
> > On Fri, Jul 02, 2010 at 07:03:55PM -0700, Andrew Morton wrote:
> > > um, nesting a kernel-wide singleton lock inside a per-inode lock is
> > > plain nutty.
> >
> > I think it worked out OK. Because the kernel-wide locks are really
> > restricted in where they are to be used (ie. not in filesystems). So
> > they're really pretty constrained to the inode management subsystem.
> > So filesystems still get to really use i_lock as an inner most lock
> > for their purposes.
> >
> > And filesystems get to take i_lock and stop all manipulation of inode
> > now. No changing of flags, moving it on/off lists etc behind its back.
> > It really is about locking the data rather than the code.
> >
> > The final ordering outcome looks like this:
> >
> > * inode->i_lock
> > * inode_list_lglock
> > * zone->inode_lru_lock
> > * wb->b_lock
> > * inode_hash_bucket lock
>
> Apart from the conceptual vandalism, it means that any contention times
> and cache transfer times on those singleton locks will increase
> worst-case hold times of our nice, fine-grained i_lock.

Yes you are quite right about contention times. I'll answer in
two parts. First, why I don't think it should prove to be a big
problem; second, what we can do to improve it if it does.

So a lot of things that previously required the much worse inode_lock
now can use the i_lock (or other fine grained locks above). Also, the
the contention only comes into play if we actually happen to hit the
same i_lock at the same time, so the throughput oriented mainline
should generally be OK, and -rt has priority inheretance that improves
that situation.

IF this proves to be a problem -- I doubt it will, in fact I think that
worst case contention experienced by filesystems and vfs will go down,
significantly -- but if it is a problem, we can easily fix it up.

Because all of those above data structures can be traversed using RCU
(hash list already is). That would make all the locks really only taken
to put an inode on or off a list.

The other thing that can be trivially done is to introduce different
locks. I don't have a problem with that, but like any other data
structure, I just would like to wait and see where we have problems.
The same argument applies to a lot of places that we use a single lock
to lock different properties of an object. We use page_lock to protect
page membership on or off LRU and pagecache lists for example, as well
as various state transitions (eg. to writeback).

In summary, if there is a lock hold problem, it is easy to use RCU to
reduce lock widths, or introduce a new per-inode lock to protect
different parts of the inode structure.

> > And it works like that because when you want to add or remove an inode
> > from various data structures, you take the i_lock
>
> Well that would be wrong. i_lock protects things *within* its inode.
> It's nonsensical to take i_lock when the inode is being added to or
> removed from external containers because i_lock doesn't protect those
> containers!

Membership in a data structure is a property of the item, conceptually
and literally when you're messing with list entries and such. There is
no conceptual vandalism that I can tell.

And it just works better this way when lifting inode_lock (and
dcache_lock). Currently, we do things like:

spin_lock(&inode_lock);
add_to_list1
add_to_list2
inode->blah = something
spin_unlock(&inode_lock);

If you lock list1, list2, and blah seperately, it is no longer an
equivalent transformation: other code could find an inode on list1 that
is now not on list2 and blah is not set.

The real data object of interest in all cases is the inode object.
Other structures are just in aid of finding inode objects that have
a particular property.

So it makes a lot of sense to have a lock to rule the inode (as opposed
to now we have a lock to rule *all* inodes).

It is possible that locking can be reduced if some things are verified
and carefully shown not to matter. I just don't see the need yet and it
would make things overly complicated I think. Introducing any more
complexity will sink this patchset.

> > and then take each
> > of these locks in turn, inside it. The alternative is to build a bigger
> > lock ordering graph, and take all the locks up-front before taking
> > i_lock. I did actaully try that and it ended up being worse, so I went
> > this route.
> >
> > I think taking a global lock in mark_inode_dirty is nutty (especially
> > when that global lock is shared with hash management, LRU scanning,
> > writeback, i_flags access... :) It's just a question of which is less
> > nutty.
>
> Yes, inode_lock is bad news.

Hopefully not for long :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: x86: enlightenment for ticket spin locks - eliminate NOPs introduced by first patch
Next: drm fixes + agp + one fb patch (bisected)