From: Andrea Arcangeli on
On Thu, Apr 01, 2010 at 06:39:48PM +0300, Avi Kivity wrote:
> On 04/01/2010 06:36 PM, Andrea Arcangeli wrote:
> > On Thu, Apr 01, 2010 at 01:16:32PM +0200, Peter Zijlstra wrote:
> >
> >> On Thu, 2010-04-01 at 14:13 +0300, Avi Kivity wrote:
> >>
> >>
> >>> If someone is willing to audit all code paths to make sure these locks
> >>> are always taken in schedulable context I agree that's a better fix.
> >>>
> >> They had better be, they're not irq-safe. Also that's what lockdep is
> >> for.
> >>
> > In my original patchset I included patches from Christoph to convert
> > those locks to mutexes, there was apparently no problem at all with
> > that. But frankly I think the only problem here is the warning. The
> > only compliant we ever had here is from developers, no users at
> > all. If this was a practical problem I think we should have heard
> > something by now with so many KVM users out there (and gru too).
> >
> > The only single reason I'd go for mutexes would be to accommodate
> > XPMEM requirements once and for all, no other reason.
> >
>
> There is also a minor benefit for kvm. Reduced latency over large mmu
> operations; code simplification (we now have some
> copy_from_user_inatomic() that could be simplified).

Where exactly is KVM taking these locks? KVM should only call into
GUP, and GUP itself won't iterate over rmaps either. GUP just walks
the host pagetables and trigger page faults if the pages aren't
mapped. I don't see how you're going to remove
copy_from_user_inatomic() given we don't have vmas and other metadata
to take those locks. Maybe we can stop calling GUP but even if we take
the anon_vma mutex/semaphore I think it won't still prevent munmap to
drop the anon pages from under us (even if it'd stop the VM to unmap
them through rmap). To freeze the mapping one would need to take
mmap_sem in write mode in addition to the anon_vma mutex/sem which is
unlikely a win compared to just gup+copy_from_user_inatomic. So I
don't see immediate benefits for KVM but maybe I'm missing something
of course!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-04-01 at 18:18 +0200, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> > One thing we can do there is to mutex_trylock() if we get the lock, see
> > if we've got the right object, if the trylock fails we can do the
> > refcount thing and sleep. That would allow the fast-path to remain a
> > single atomic.
>
> But then how do you know which anon_vma_unlink has to decrease the
> refcount and which not? That info should need to be stored in the
> kernel stack, can't be stored in the vma. I guess it's feasible but
> passing that info around sounds more tricky than the trylock itself
> (adding params to those functions with int &refcount).

I was thinking of something like:

struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma = NULL;
unsigned long anon_mapping;

rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
if (!page_mapped(page))
goto out;

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
if (!mutex_trylock(&anon_vma->lock)) {
if (atomic_inc_unless_zero(&anon_vma->ref)) {
rcu_read_unlock();
mutex_lock(&anon_vma->lock);
atomic_dec(&anon_vma->ref); /* ensure the lock pins it */
} else
anon_vma = NULL;
}
rcu_read_unlock();

return anon_vma;
}

void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
mutex_unlock(&anon_vma->lock);
}

Then anybody reaching ref==0 would only need to sync against the lock
before freeing.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Thu, Apr 01, 2010 at 05:56:46PM +0200, Peter Zijlstra wrote:
> I would much rather we make call_rcu_preempt() available at all times.

srcu is needed only for XPMEM to make the mmu notifier handlers
sleepable. Ignore it for now, it can be done later. The locks you're
changing are always taken _before_ the mmu notifier_range_start and
always after the mmu_notifier_range_end, so srcu can be done later...

It's orthogonal issue, but the moment these locks are sleepable it
simply worth to switch mmu notifiers to srcu to accommodate XPMEM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-04-01 at 18:45 +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 18:18 +0200, Andrea Arcangeli wrote:
> > On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> > > One thing we can do there is to mutex_trylock() if we get the lock, see
> > > if we've got the right object, if the trylock fails we can do the
> > > refcount thing and sleep. That would allow the fast-path to remain a
> > > single atomic.
> >
> > But then how do you know which anon_vma_unlink has to decrease the
> > refcount and which not? That info should need to be stored in the
> > kernel stack, can't be stored in the vma. I guess it's feasible but
> > passing that info around sounds more tricky than the trylock itself
> > (adding params to those functions with int &refcount).
>
> I was thinking of something like:
>
> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
> struct anon_vma *anon_vma = NULL;
> unsigned long anon_mapping;
>
> rcu_read_lock();
> anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> if (!page_mapped(page))
> goto out;
>
> anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> if (!mutex_trylock(&anon_vma->lock)) {
> if (atomic_inc_unless_zero(&anon_vma->ref)) {
> rcu_read_unlock();
> mutex_lock(&anon_vma->lock);
> atomic_dec(&anon_vma->ref); /* ensure the lock pins it */
> } else
> anon_vma = NULL;
> }
> rcu_read_unlock();
>
> return anon_vma;
> }
>
> void page_unlock_anon_vma(struct anon_vma *anon_vma)
> {
> mutex_unlock(&anon_vma->lock);
> }
>
> Then anybody reaching ref==0 would only need to sync against the lock
> before freeing.

Ah, there is a race where the dec after lock makes it 0, we could catch
that by making it -1 and free in unlock_anon_vma().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Thu, Apr 01, 2010 at 06:49:52PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 18:45 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-04-01 at 18:18 +0200, Andrea Arcangeli wrote:
> > > On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> > > > One thing we can do there is to mutex_trylock() if we get the lock, see
> > > > if we've got the right object, if the trylock fails we can do the
> > > > refcount thing and sleep. That would allow the fast-path to remain a
> > > > single atomic.
> > >
> > > But then how do you know which anon_vma_unlink has to decrease the
> > > refcount and which not? That info should need to be stored in the
> > > kernel stack, can't be stored in the vma. I guess it's feasible but
> > > passing that info around sounds more tricky than the trylock itself
> > > (adding params to those functions with int &refcount).
> >
> > I was thinking of something like:
> >
> > struct anon_vma *page_lock_anon_vma(struct page *page)
> > {
> > struct anon_vma *anon_vma = NULL;
> > unsigned long anon_mapping;
> >
> > rcu_read_lock();
> > anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > goto out;
> > if (!page_mapped(page))
> > goto out;
> >
> > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > if (!mutex_trylock(&anon_vma->lock)) {
> > if (atomic_inc_unless_zero(&anon_vma->ref)) {
> > rcu_read_unlock();
> > mutex_lock(&anon_vma->lock);
> > atomic_dec(&anon_vma->ref); /* ensure the lock pins it */
> > } else
> > anon_vma = NULL;
> > }
> > rcu_read_unlock();
> >
> > return anon_vma;
> > }
> >
> > void page_unlock_anon_vma(struct anon_vma *anon_vma)
> > {
> > mutex_unlock(&anon_vma->lock);
> > }
> >
> > Then anybody reaching ref==0 would only need to sync against the lock
> > before freeing.
>
> Ah, there is a race where the dec after lock makes it 0, we could catch
> that by making it -1 and free in unlock_anon_vma().

You'd simply need to atomic_dec_and_test instead of atomic_dec above,
then you free it there above and return NULL.

The bug is to ever call atomic_dec instead of always
atomic_dec_and_test all over the place. I doubt you could fix it with
a -1.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/