rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA [Kernel]

Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume

From: Rik van Riel on 7 Apr 2010 18:00

On 04/07/2010 05:19 PM, Linus Torvalds wrote:

> Comments?

I remember there being an "unfixable" spot with this
approach when I originally wrote the new anon_vma
linking code.

However, I can't for the life of me find that spot.
I am starting to believe I made it fixable as a side
effect of one of the changes I made :)

One of the issues with your patch is that anon_vma_prepare
can fail and this patch ignores its return value.

Having anon_vma-prepare fail after an mremap or mprotect
might result in messing up the VMAs of a process, or having
to undo the VMA changes that were made.

In fact, this may be the problem I was running into - not
wanting to add even more complex error paths to the vma
shuffling code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 7 Apr 2010 18:20

On Wed, 7 Apr 2010, Rik van Riel wrote:
>
> One of the issues with your patch is that anon_vma_prepare
> can fail and this patch ignores its return value.

Yes. The failure point is too late to do anything really interesting with,
and the old code also just causes a SIGBUS. My intention was to change the

WARN_ONCE(!vma->anon_vma);

into returning that SIGBUS - which is not wonderful, but is no different
from old failures.

In the long run, it would be nicer to actually return an error from the
mmap() that fails, but that's more complicated, and as mentioned, it's not
what the old code used to do either (since the failure point was always at
the page fault stage).

> Having anon_vma-prepare fail after an mremap or mprotect
> might result in messing up the VMAs of a process, or having
> to undo the VMA changes that were made.

We really aren't any worse off than we have always been.

If anon_vma_prepare() fails, the vma list will be valid, but no new pages
can be added to that vma. That used to be true before too.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 7 Apr 2010 18:30

On Wed, 7 Apr 2010, Linus Torvalds wrote:
>
> In the long run, it would be nicer to actually return an error from the
> mmap() that fails, but that's more complicated, and as mentioned, it's not
> what the old code used to do either (since the failure point was always at
> the page fault stage).

Put another way: I'm not proud of it, but the new code isn't any worse
than what we used to have, and I think the new code is _fixable_.

The easiest way to do that would likely be to pre-allocate the anon_vma
struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That
way anon_vma_prepare() itself can never fail, and all we need to do is a
simple allocation earlier in the call-chain.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 7 Apr 2010 19:50

On Wed, 7 Apr 2010, Linus Torvalds wrote:
>
> Yes. The failure point is too late to do anything really interesting with,
> and the old code also just causes a SIGBUS. My intention was to change the
>
> WARN_ONCE(!vma->anon_vma);
>
> into returning that SIGBUS - which is not wonderful, but is no different
> from old failures.

Not SIGBUS, but VM_FAULT_OOM, of course.

IOW, something like this should be no worse than what we have now, and has
the much nicer locking semantics.

Having done some more digging, I can point to a downside: we do end up
having about twice as many anon_vma entries. It seems about half of the
vma's never need an anon_vma entry, probably because they end up being
read-only file mappings, and thus never trigger the anonvma case.

That said:

- I don't really think you can fix the locking problem you have in a
saner way

- the anon_vma entry is much smaller than the vm_area_struct, so we're
still using much less memory for them than for vma's.

- We _could_ avoid allocating anonvma entries for shared mappings or for
mappings that are read-only. That might force us to allocate some of
them at mprotect time, and/or when doing a forced COW event with
ptrace, but we have the mmap_sem for writing for the one case, and we
could decide to get it for the other.

So it's not a _fundamental_ problem if we decide we want to recover
most of the memory lost by doing unconditional allocations.

There are alternative models. For example, the VM layer _could_ decide to
just release the mmap_sem, and re-do it and take it for writing if the vma
doesn't have an anon_vma.

I dunno. I like how this patch makes things so much less subtle, though.
For example: with this in place, we could further simplify
anon_vma_prepare(), since it would now never have the re-entrancy issue
and wouldn't need to worry about taking that page_table_lock and
re-testing vma->anon_vma for races.

Linus

---
mm/memory.c | 12 +++---------
mm/mmap.c | 17 ++++-------------
2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..b5efe76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
-
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd;
pte_t *pte;

+ if (!vma->anon_vma)
+ return VM_FAULT_OOM;
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..c14284b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

mm->map_count++;
validate_mm(mm);
+
+ anon_vma_prepare(vma);
}

/*
@@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
BUG_ON(__vma && __vma->vm_start < vma->vm_end);
__vma_link(mm, vma, prev, rb_link, rb_parent);
mm->map_count++;
+
+ anon_vma_prepare(vma);
}

static inline void
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
anon_vma_lock(vma);

/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
{
int error;

- /*
- * We must make sure the anon_vma is allocated
- * so that the anon_vma locking is not a noop.
- */
- if (unlikely(anon_vma_prepare(vma)))
- return -ENOMEM;
-
address &= PAGE_MASK;
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 7 Apr 2010 20:50

On 04/07/2010 06:15 PM, Linus Torvalds wrote:
> On Wed, 7 Apr 2010, Linus Torvalds wrote:
>>
>> In the long run, it would be nicer to actually return an error from the
>> mmap() that fails, but that's more complicated, and as mentioned, it's not
>> what the old code used to do either (since the failure point was always at
>> the page fault stage).
>
> Put another way: I'm not proud of it, but the new code isn't any worse
> than what we used to have, and I think the new code is _fixable_.

Agreed, it is no worse than what we had before.

As to fixable, I supect both situations are fixable.
The new code by getting the error paths right, the
old code by completely bailing out of the page fault
and retrying it (the pageout code should trigger an
OOM kill at some point, if we are really out of memory).

> The easiest way to do that would likely be to pre-allocate the anon_vma
> struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That
> way anon_vma_prepare() itself can never fail, and all we need to do is a
> simple allocation earlier in the call-chain.

That may not work, because we may want to merge the anon_vma
with the anon_vma in an adjacant VMA ... and that adjacant
VMA could be chained onto multiple anon_vmas.

That means allocating a single anon_vma_chain may not be
enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume