From: Peter Zijlstra on
On Thu, 2010-04-01 at 17:42 +0200, Andrea Arcangeli wrote:
> The ugliest part of it (that I think you missed below) is the breakage
> of the RCU locking in the anon-vma which requires adding refcounting
> to it. That was the worst part of the conversion as far as I can tell.
>
One thing we can do there is to mutex_trylock() if we get the lock, see
if we've got the right object, if the trylock fails we can do the
refcount thing and sleep. That would allow the fast-path to remain a
single atomic.

The only thing is growing that anon_vma struct, but KSM seems to already
do that for the most common .configs.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> One thing we can do there is to mutex_trylock() if we get the lock, see
> if we've got the right object, if the trylock fails we can do the
> refcount thing and sleep. That would allow the fast-path to remain a
> single atomic.

But then how do you know which anon_vma_unlink has to decrease the
refcount and which not? That info should need to be stored in the
kernel stack, can't be stored in the vma. I guess it's feasible but
passing that info around sounds more tricky than the trylock itself
(adding params to those functions with int &refcount).

> The only thing is growing that anon_vma struct, but KSM seems to already
> do that for the most common .configs.

Ok, from my initial review of memory compaction I see that it already
adds its own recount and unifies it too with the ksm_refcount. So it's
worth stop calling it ksm_refcount and to remove the #ifdef (which
memory compaction didn't remove but added an defined(CONFIG_KSM) ||
defined(CONFIG_MIGRATION)).

If you do this change we can have your change applied first, and then
Mel can adapt memory compaction to it sharing the same unconditional
compiled-in refcount.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Thu, Apr 01, 2010 at 05:50:02PM +0200, Peter Zijlstra wrote:
> You also seem to move the tlb_gather stuff around, we have patches in

Well that patchset is years old, there are much more recent patches to
move tlb_gather around so that it will happen after the mmu_notifier
invalidate calls. That is only relevant to allow mmu notifier handlers
to schedule.

> -rt that make tlb_gather preemptible, once i_mmap_lock is preemptible we
> can do in mainline too.

Ok. However just moving it inside the mmu notifier range calls won't
slowdown anything or reduce its effectiveness, either ways will be
fine with XPMEM. This is only XPMEM material and tlb_gather is the
trivial part of it. The hard part is to make those locks schedule
capable, and I'm sure XPMEM developers will greatly appreciate such a
change. I thought initially it had to be conditional to XPMEM=y but
maintaining two different locks is a bit of a pain (especially for
distros that wants to ship a single kernel) but as this effort
started, it'd provide some minor latency improvement in that 1msec
when a new virtual machine starts or when a new taks registers itself
to GRU or XPMEM device drivers. I'm personally fine to make these
either mutexes or rwsem unconditionally as you prefer.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> >
> > I've almost got a patch done that converts those two, still need to look
> > where that tasklist_lock muck happens.
>
> OK, so the below builds and boots, only need to track down that
> tasklist_lock nesting, but I got to run an errand first.

You should have a look at my old patchset where Christoph already
implemented this (and not for decreasing latency but to allow
scheduling in mmu notifier handlers, only needed by XPMEM):

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/

The ugliest part of it (that I think you missed below) is the breakage
of the RCU locking in the anon-vma which requires adding refcounting
to it. That was the worst part of the conversion as far as I can tell.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma

I personally prefer read-write locks that Christoph used for both of
them, but I'm not against mutex either. Still the refcounting problem
should be the same as it's introduced by allowing the critical
sections under anon_vma->lock to schedule (no matter if it's mutex or
read-write sem).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-04-01 at 18:07 +0200, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 05:56:02PM +0200, Peter Zijlstra wrote:
> > Another thing is mm->nr_ptes, that doens't appear to be properly
> > serialized, __pte_alloc() does ++ under mm->page_table_lock, but
> > free_pte_range() does -- which afaict isn't always with page_table_lock
> > held, it does however always seem to have mmap_sem for writing.
>
> Not saying this is necessarily safe, but how can be that relevant with
> spinlock->mutex/rwsem conversion?

Not directly, but I keep running into that BUG_ON() at the end up
exit_mmap() with my conversion patch, and I though that maybe I widened
the race window.

But I guess I simply messed something up.

> Only thing that breaks with that
> conversion would be RCU (the very anon_vma rcu breaks because it
> rcu_read_lock disabling preempt and then takes the anon_vma->lock,
> that falls apart because taking the anon_vma->lock will imply a
> schedule), but nr_ptes is a write operation so it can't be protected
> by RCU.
>
> > However __pte_alloc() callers do not in fact hold mmap_sem for writing.
>
> As long as the mmap_sem readers always also take the page_table_lock
> we're safe.

Ah, I see so its: down_read(mmap_sem) + page_table_lock that's exclusive
against down_write(mmap_sem), nifty, should be a comment somewhere.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/