From: Mark Veltzer on
Hello all!

I have searched the list for similar issues and have not found an answer so I
am posting.

I am using 'get_user_pages' and friends to get a hold of user memory in kernel
space. User space passes buffer to kernel, kernel does get_user_pages, holds
them for some time while user space is doing something else, writes to the
pages and then releases them (SetPageDirty and page_cache_release as per LDD
3rd edition). So far so good.

I am testing this kernel module with several buffers from user space allocated
in several different ways. heap, data segment, static variable in function and
stack. All scenarious work EXCEPT the stack one. When passing the stack buffer
the kernel sees one thing while user space sees another.

My not so intelligent questions (they may well be off the mark):
- How can this be? (two views of the same page)
- Does not 'get_user_pages' pin the pages?
- Could this be due to stack protection of some sort?
- Do I need to do anything extra with the vm_area I receive for the stack
pages EXCEPT 'get_user_pages' ?

I know this is not an orthodox method to write a driver and I better use mmap
for these things but I have other constrains in this driver design that I do
not want to bore you with. I am also awara that passing a buffer on stack and
letting user space continue running is a very dangerous thing to do for user
space (or kernel space) integrity. I wish I could do it another way...

The platform is x86 32 bit standad with standard kernels and headers
distributed with ubuntu 9.04 and 9.10 which are 2.6.28 and 2.6.31.

Please reply to my email as well as I am not a subscriber.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
Mark Veltzer <mark.veltzer(a)gmail.com> writes:
>
> I am testing this kernel module with several buffers from user space allocated
> in several different ways. heap, data segment, static variable in function and
> stack. All scenarious work EXCEPT the stack one. When passing the stack buffer
> the kernel sees one thing while user space sees another.

In theory it should work, stack is no different from any other pages.
First thought was that you used some platform with incoherent caches,
but that doesn't seem to be the case if it's standard x86.

> My not so intelligent questions (they may well be off the mark):
> - How can this be? (two views of the same page)

It should not be on a coherent platform.

> - Does not 'get_user_pages' pin the pages?

Yes it does.

> - Could this be due to stack protection of some sort?

No.

> - Do I need to do anything extra with the vm_area I receive for the stack
> pages EXCEPT 'get_user_pages' ?

No. Stack is like any other user memory.

Most likely it's some bug in your code.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Hugh Dickins on
On Mon, 9 Nov 2009, Andi Kleen wrote:
> Mark Veltzer <mark.veltzer(a)gmail.com> writes:
> >
> > I am testing this kernel module with several buffers from user space allocated
> > in several different ways. heap, data segment, static variable in function and
> > stack. All scenarious work EXCEPT the stack one. When passing the stack buffer
> > the kernel sees one thing while user space sees another.
>
> In theory it should work, stack is no different from any other pages.
> First thought was that you used some platform with incoherent caches,
> but that doesn't seem to be the case if it's standard x86.

It may be irrelevant to Mark's stack case, but it is worth mentioning
the fork problem: how a process does get_user_pages to pin down a buffer
somewhere in anonymous memory, a thread forks (write protecting anonymous
memory shared between parent and child), child userspace writes to a
location in the same page as that buffer, causing copy-on-write which
breaks the connection between the get_user_pages buffer and what child
userspace sees there afterwards.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mark Veltzer on
On Monday 09 November 2009 12:32:52 you wrote:
> On Mon, 9 Nov 2009, Andi Kleen wrote:
> > Mark Veltzer <mark.veltzer(a)gmail.com> writes:
> > > I am testing this kernel module with several buffers from user space
> > > allocated in several different ways. heap, data segment, static
> > > variable in function and stack. All scenarious work EXCEPT the stack
> > > one. When passing the stack buffer the kernel sees one thing while user
> > > space sees another.
> >
> > In theory it should work, stack is no different from any other pages.
> > First thought was that you used some platform with incoherent caches,
> > but that doesn't seem to be the case if it's standard x86.
>
> It may be irrelevant to Mark's stack case, but it is worth mentioning
> the fork problem: how a process does get_user_pages to pin down a buffer
> somewhere in anonymous memory, a thread forks (write protecting anonymous
> memory shared between parent and child), child userspace writes to a
> location in the same page as that buffer, causing copy-on-write which
> breaks the connection between the get_user_pages buffer and what child
> userspace sees there afterwards.
>
> Hugh
>

Thanks Hugh and Andi

Hugh, you actually hit the nail on the head!

I was forking while doing these mappings and the child won the race and got to
keep the pinned pages while the parent got left with a copy which meant
nothing. The thing is that it was hard to spot because I was using a library
function which called a function etc... which eventually did some system(3).
It only happened on in stack testing case bacause the child was not really
doing anything with the pinned memory on purpose and so in all other cases did
not touch the memory except the stack which it, ofcourse, uses. The child won
the race in the stack case and so shared the data with the kernel and the
parent got a copy with the old data.

I understand that madvise(2) can prevent this copy-on-write and race between
child and parent and I also duplicated it in the kernel using the following
code:

[lock the current->mm for writing]
vma=find_vma(current->mm, [user pointer])
vma->vm_flags|=VM_DONTCOPY
[unlock the current->mm for writing]

The above code is actually a kernel version of madvise(2) and MADV_DONTFORK.

The problem with this solution (either madvise in user space or DONTCOPY in
kernel) is that I give up the ability to fork(2) since the child is left
stackless (or with a hold in it's stack - im not sure...)

My question is: is there a way to allow forking while still pinning STACK
memory via get_user_pages? I can actually live with the current solution since
I can make sure that the user space thread that does the work with the driver
never forks but I'm interested to know what other neat vm tricks linux has up
it's sleeve...

BTW: would it not be a good addition to the madvise(2) manpage to state that
you should be careful with doing madvise(DONTFORK) because you may segfault
your children and that doing so on a stack address has even more chance of
crashing children ? Who should I talk about adding this info to the manual
page? The current manpage that I have only talks about scatter-gather uses of
DONTFORK and does not mention the problems of DONTFORK...

Thanks in advance
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Hugh Dickins on
On Tue, 10 Nov 2009, Mark Veltzer wrote:
> On Monday 09 November 2009 12:32:52 you wrote:
> > On Mon, 9 Nov 2009, Andi Kleen wrote:
> > > Mark Veltzer <mark.veltzer(a)gmail.com> writes:
> > > > I am testing this kernel module with several buffers from user space
> > > > allocated in several different ways. heap, data segment, static
> > > > variable in function and stack. All scenarious work EXCEPT the stack
> > > > one. When passing the stack buffer the kernel sees one thing while user
> > > > space sees another.
> > >
> > > In theory it should work, stack is no different from any other pages.
> > > First thought was that you used some platform with incoherent caches,
> > > but that doesn't seem to be the case if it's standard x86.
> >
> > It may be irrelevant to Mark's stack case, but it is worth mentioning
> > the fork problem: how a process does get_user_pages to pin down a buffer
> > somewhere in anonymous memory, a thread forks (write protecting anonymous
> > memory shared between parent and child), child userspace writes to a
> > location in the same page as that buffer, causing copy-on-write which
> > breaks the connection between the get_user_pages buffer and what child
> > userspace sees there afterwards.
>
> Thanks Hugh and Andi
>
> Hugh, you actually hit the nail on the head!

I'm glad that turned out to be relevant and helpful.

>
> I was forking while doing these mappings and the child won the race and got to
> keep the pinned pages while the parent got left with a copy which meant
> nothing. The thing is that it was hard to spot because I was using a library
> function which called a function etc... which eventually did some system(3).
> It only happened on in stack testing case bacause the child was not really
> doing anything with the pinned memory on purpose and so in all other cases did
> not touch the memory except the stack which it, ofcourse, uses. The child won
> the race in the stack case and so shared the data with the kernel and the
> parent got a copy with the old data.
>
> I understand that madvise(2) can prevent this copy-on-write and race between
> child and parent and I also duplicated it in the kernel using the following
> code:
>
> [lock the current->mm for writing]
> vma=find_vma(current->mm, [user pointer])
> vma->vm_flags|=VM_DONTCOPY
> [unlock the current->mm for writing]
>
> The above code is actually a kernel version of madvise(2) and MADV_DONTFORK.
>
> The problem with this solution (either madvise in user space or DONTCOPY in
> kernel) is that I give up the ability to fork(2) since the child is left
> stackless (or with a hold in it's stack - im not sure...)
>
> My question is: is there a way to allow forking while still pinning STACK
> memory via get_user_pages? I can actually live with the current solution since
> I can make sure that the user space thread that does the work with the driver
> never forks but I'm interested to know what other neat vm tricks linux has up
> it's sleeve...

I think MADV_DONTFORK is as far as we've gone,
but I might be forgetting something.

In fairness I've added Andrea and KOSAKI-san to the Cc, since I know
they are two people keen to fix this issue once and for all. Whereas
I am with Linus in the opposite camp: solutions have looked nasty,
and short of bright new ideas, I feel we've gone as far as we ought.

Just don't do that: don't test the incompatibility of GUP pinning
versus COW semantics, by placing such buffers in problematic areas
while forking.

(That sentence might be more convincing if we put in more thought,
to enumerate precisely which areas are "problematic".)

>
> BTW: would it not be a good addition to the madvise(2) manpage to state that
> you should be careful with doing madvise(DONTFORK) because you may segfault
> your children and that doing so on a stack address has even more chance of
> crashing children ? Who should I talk about adding this info to the manual
> page? The current manpage that I have only talks about scatter-gather uses of
> DONTFORK and does not mention the problems of DONTFORK...

Michael looks after the manpages, I've added him to the Cc.
Yes, an additional sentence there might indeed be helpful.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/