From: Andrew Morton on
On Fri, 9 Oct 2009 13:50:17 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:

> ok, all lkml reviewer ack this patch.
> then, I've switched to linux-api review.
>
> Any comments?
>
>
>
> ==================================================
> From: Timo Sirainen <tss(a)iki.fi>
>
> Currently glibc2 doesn't have setproctitle(3), so several userland
> daemons attempt to emulate it by doing some brutal stack modifications.
> This works most of the time, but it has problems. For example:
>
> % ps -ef |grep avahi-daemon
> avahi 1679 1 0 09:20 ? 00:00:00 avahi-daemon: running [kosadesk.local]
>
> # cat /proc/1679/cmdline
> avahi-daemon: running [kosadesk.local]
>
> This looks good, but the process has also overwritten its environment
> area and made the environ file useless:
>
> # cat /proc/1679/environ
> adesk.local]
>
> Another problem is that the process title length is limited by the size of
> the environment. Security conscious people try to avoid potential information
> leaks by clearing most of the environment before running a daemon:
>
> # env - MINIMUM_NEEDED_VAR=foo /path/to/daemon
>
> The resulting environment size may be too small to fit the wanted process
> titles.
>
> This patch makes it possible for userspace to implement setproctitle()
> cleanly. It adds a new PR_SET_PROCTITLE_AREA option for prctl(), which
> updates task's mm_struct->arg_start and arg_end to the given area.
>
> test_setproctitle.c
> ================================================
> #define ERR(str) (perror(str), exit(1))
>
> void settitle(char* title){
> int err;
>
> err = prctl(34, title, strlen(title)+1);
> if (err < 0)
> ERR("prctl ");
> }
>
> void main(void){
> long i;
> char buf[1024];
>
> for (i = 0; i < 10000000000LL; i++){
> sprintf(buf, "loooooooooooooooooooooooong string %d",i);
> settitle(buf);
> }
> }

gee that's a good changelog.

> ...
>
> @@ -255,32 +255,47 @@ static int proc_pid_cmdline(struct task_struct *task, char * buffer)
> int res = 0;
> unsigned int len;
> struct mm_struct *mm = get_task_mm(task);
> + unsigned seq;
> +
> if (!mm)
> goto out;
> +
> + /* The process was not constructed yet? */
> if (!mm->arg_end)
> - goto out_mm; /* Shh! No looking before we're done */
> + goto out_mm;
>
> - len = mm->arg_end - mm->arg_start;
> -
> - if (len > PAGE_SIZE)
> - len = PAGE_SIZE;
> -
> - res = access_process_vm(task, mm->arg_start, buffer, len, 0);
> -
> - // If the nul at the end of args has been overwritten, then
> - // assume application is using setproctitle(3).
> - if (res > 0 && buffer[res-1] != '\0' && len < PAGE_SIZE) {
> - len = strnlen(buffer, res);
> - if (len < res) {
> - res = len;
> - } else {
> - len = mm->env_end - mm->env_start;
> - if (len > PAGE_SIZE - res)
> - len = PAGE_SIZE - res;
> - res += access_process_vm(task, mm->env_start, buffer+res, len, 0);
> + do {
> + seq = read_seqbegin(&mm->arg_lock);
> +
> + len = mm->arg_end - mm->arg_start;
> + if (len > PAGE_SIZE)
> + len = PAGE_SIZE;
> +
> + res = access_process_vm(task, mm->arg_start, buffer, len, 0);
> +
> + if (mm->arg_end != mm->env_start)
> + /* PR_SET_PROCTITLE_AREA used */
> res = strnlen(buffer, res);

Hang on.

If PR_SET_PROCTITLE_AREA installed a bad address then
access_process_vm() will return -EFAULT and nothing was written to
`buffer'?

Also, concurrent PR_SET_PROCTITLE_AREA could cause arg_start and
arg_end to have inconsistent/incoherent values. This might result in a
fault but it also might result in a read of garbage userspace memory.

I guess all of these issues could be written off as "userspace bugs",
but our behaviour here isn't nice. We should at least return that
-EFAULT!.

> + else if (res > 0 && buffer[res-1] != '\0' && len < PAGE_SIZE) {
> + /*
> + * If the nul at the end of args has been overwritten,
> + * then assume application is using sendmail's
> + * SPT_REUSEARGV style argv override.
> + */
> + len = strnlen(buffer, res);
> + if (len < res) {
> + res = len;
> + } else {
> + len = mm->env_end - mm->env_start;
> + if (len > PAGE_SIZE - res)
> + len = PAGE_SIZE - res;
> + res += access_process_vm(task, mm->env_start,
> + buffer+res, len, 0);
> + res = strnlen(buffer, res);
> + }

And if access_process_vm() returns a -ve errno here, the code simply
flies off into the weeds.

> }
> - }
> + } while (read_seqretry(&mm->arg_lock, seq));
> +
> out_mm:
> mmput(mm);
>
> ...
>
> @@ -236,6 +237,7 @@ struct mm_struct {
> unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
> + seqlock_t arg_lock;
> unsigned long arg_start, arg_end, env_start, env_end;
>
> unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

seqlocks are nice but I have to wonder if they made this patch
unnecessarily complex. Couldn't we just do:

PR_SET_PROCTITLE_AREA:

spin_lock(mm->lock);
mm->arg_start = addr;
mm->arg_end = addr + len;
spin_unlock(mm->lock);

and

proc_pid_cmdline()
{
unsigned long addr;
unsigned long end;

spin_lock(mm->lock);
addr = mm->arg_start;
end = mm->arg_end;
spin_unlock(mm->lock);

<use `addr' and `len'>
}

?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bryan Donlan on
On Fri, Oct 9, 2009 at 8:13 PM, Andrew Morton <akpm(a)linux-foundation.org> wrote:

>> + � � � � � � res = access_process_vm(task, mm->arg_start, buffer, len, 0);
>> +
>> + � � � � � � if (mm->arg_end != mm->env_start)
>> + � � � � � � � � � � /* PR_SET_PROCTITLE_AREA used */
>> � � � � � � � � � � � res = strnlen(buffer, res);
>
> Hang on.
>
> If PR_SET_PROCTITLE_AREA installed a bad address then
> access_process_vm() will return -EFAULT and nothing was written to
> `buffer'?
>
> Also, concurrent PR_SET_PROCTITLE_AREA could cause arg_start and
> arg_end to have inconsistent/incoherent values. �This might result in a
> fault but it also might result in a read of garbage userspace memory.
>
> I guess all of these issues could be written off as "userspace bugs",
> but our behaviour here isn't nice. �We should at least return that
> -EFAULT!.

access_process_vm() does not return an error code - just the number of
bytes successfully transferred. If there's a fault, we just get 0 (or
some intermediate value) back (as discussed on lkml).

>> + � � � � � � � � � � � � � � res += access_process_vm(task, mm->env_start,
>> + � � � � � � � � � � � � � � � � � � � � � � � � � � �buffer+res, len, 0);
>> + � � � � � � � � � � � � � � res = strnlen(buffer, res);
>> + � � � � � � � � � � }
>
> And if access_process_vm() returns a -ve errno here, the code simply
> flies off into the weeds.

This code was originally there - it's just been lifted into an if :)
and since access_process_vm will never return a negative errno, it's
not a problem.

Now, there might be a case for arguing that we should try harder to
get an error code (-EIO if we don't read the number of bytes we asked
for), but /proc/pid/cmdline never has before, and that would then make
this a visible change for consumers of /proc/pid/cmdline. Can ps
handle reading cmdline returning -EIO?

> seqlocks are nice but I have to wonder if they made this patch
> unnecessarily complex. �Couldn't we just do:
>
> � � � � PR_SET_PROCTITLE_AREA:
>
> � � � � � � � �spin_lock(mm->lock);
> � � � � � � � �mm->arg_start = addr;
> � � � � � � � �mm->arg_end = addr + len;
> � � � � � � � �spin_unlock(mm->lock);
>
> and
>
> � � � �proc_pid_cmdline()
> � � � �{
> � � � � � � � �unsigned long addr;
> � � � � � � � �unsigned long end;
>
> � � � � � � � �spin_lock(mm->lock);
> � � � � � � � �addr = mm->arg_start;
> � � � � � � � �end = mm->arg_end;
> � � � � � � � �spin_unlock(mm->lock);
>
> � � � � � � � �<use `addr' and `len'>
> � � � �}
>
> ?

As discussed on the lkml thread, this opens up a nasty race:

Process A: calls PR_SET_PROCTITLE_AREA
Process B: read cmdline:
Process B: spin_lock
Process B: read mm->arg_*
Process B: unlock
Process A: mm->arg_* = ...
Process A: free(old_cmdline_area)
Process A: secret_password = malloc(...)
Process A: strcpy(secret_password, stuff);
Process B: access_process_vm

If secret_password == arg_start, then process B just read secret
information from process A's address space. Since process A has no
idea when or even if process B will complete its read, it has no way
of protecting itself from this, save from never reusing any memory
which has ever been assigned to a proctitle area.

The solution is to use the seqlock to detect this, and prevent the
secret information from ever making it back to process B's userspace.
Note that it's not enough to just recheck arg_start, as process A may
reassign the proctitle area back to its original position after having
it somewhere else for a while.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Sat, 10 Oct 2009 15:32:35 +0900 KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:

> >>> The solution is to use the seqlock to detect this, and prevent the
> >>> secret information from ever making it back to process B's userspace.
> >>> Note that it's not enough to just recheck arg_start, as process A may
> >>> reassign the proctitle area back to its original position after having
> >>> it somewhere else for a while.
> >>
> >> Well seqlock is _a_ solution. __Another is to use a mutex or an rwsem
> >> around the whole operation.
> >>
> >> With the code as you propose it, what happens if a process sits in a
> >> tight loop running setproctitle? __Do other processes running `ps' get
> >> stuck in a livelock until the offending process gets scheduled out?
> >
> > It does seem like a maximum spin count should be put in there - and
> > maybe a timeout as well (since with FUSE etc it's possible to engineer
> > page faults that take arbitrarily long).
> > Also, it occurs to me that:
>
> makes sense.
> I like maximum spin rather than timeout.

Start simple. What's wrong with mutex_lock() on the reader and writer
sides? rwsems might be OK too.

In both cases we should think about whether persistent readers can
block the writer excessively though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bryan Donlan on
On Sat, Oct 10, 2009 at 2:32 AM, KOSAKI Motohiro
<kosaki.motohiro(a)jp.fujitsu.com> wrote:

>> It does seem like a maximum spin count should be put in there - and
>> maybe a timeout as well (since with FUSE etc it's possible to engineer
>> page faults that take arbitrarily long).
>> Also, it occurs to me that:
>
> makes sense.
> I like maximum spin rather than timeout.

I'm worried about the scenario where process A sets its cmdline buffer
to point to a page which will take a _VERY_ long time to pagein (maybe
forever), and then process B goes to try to read its cmdline. What
happens now?
Process A can arrange for this to happen by using a FUSE filesystem
that sits on a read forever. And since the first thing the admin's
likely to do to track down the problem is 'ps awux', this is liable to
be a rather nasty DoS...

Of course, this is no worse than it is now - it's already possible to
replace the page in question. But we should think about ways this
could be fixed for good...

>
>>> + � � do {
>>> + � � � � � � seq = read_seqbegin(&mm->arg_lock);
>>> +
>>> + � � � � � � len = mm->arg_end - mm->arg_start;
>>> + � � � � � � if (len > PAGE_SIZE)
>>> + � � � � � � � � � � len = PAGE_SIZE;
>>
>> If arg_end or arg_start are modified after this, is it truly safe to
>> assume that len will remain <= PAGE_SIZE without a memory barrier
>> before the conditional?
>
> 1) access_process_vm() doesn't return error value.
> 2) read_seqretry(&mm->arg_lock, seq)) check seq, not mm->arg_start or len.
>
> then, if arg_{start,end} is modified, access_process_vm() may return 0
> and strnlen
> makes bad calculation, but read_seqretry() can detect its modify
> rightly. I think.

No, I'm worried about what if the compiler decides to rewrite like so:
if (mm->arg_end - mm->arg_start > PAGE_SIZE)
len = PAGE_SIZE;
else /* here we reload arg_end/arg_start! */
len = mm->arg_end - mm->arg_start;

Now we might write into buffer more than PAGE_SIZE bytes, which is
probably a buffer overrun into kernel space...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Américo Wang on
On Sun, Nov 01, 2009 at 09:16:27PM +0900, KOSAKI Motohiro wrote:
>
>ChangeLog
> v3 -> v4
> - Use mutex instead seq_lock as akpm requested.
>
>========================================
>
>From: Timo Sirainen <tss(a)iki.fi>


Please see my comments below.

>
>Currently glibc2 doesn't have setproctitle(3), so several userland
>daemons attempt to emulate it by doing some brutal stack modifications.
>This works most of the time, but it has problems. For example:
>
> % ps -ef |grep avahi-daemon
> avahi 1679 1 0 09:20 ? 00:00:00 avahi-daemon: running [kosadesk.local]
>
> # cat /proc/1679/cmdline
> avahi-daemon: running [kosadesk.local]
>
>This looks good, but the process has also overwritten its environment
>area and made the environ file useless:
>
> # cat /proc/1679/environ
> adesk.local]
>
>Another problem is that the process title length is limited by the size of
>the environment. Security conscious people try to avoid potential information
>leaks by clearing most of the environment before running a daemon:
>
> # env - MINIMUM_NEEDED_VAR=foo /path/to/daemon
>
>The resulting environment size may be too small to fit the wanted process
>titles.
>
>This patch makes it possible for userspace to implement setproctitle()
>cleanly. It adds a new PR_SET_PROCTITLE_AREA option for prctl(), which
>updates task's mm_struct->arg_start and arg_end to the given area.
>
> test_setproctitle.c
> ================================================
> #include <string.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <sys/prctl.h>
>
> #define ERR(str) (perror(str), exit(1))
>
> void settitle(char* title){
> int err;
>
> err = prctl(34, title, strlen(title)+1);
> if (err < 0)
> ERR("prctl ");
> }
>
> void main(void){
> long i;
> char buf[1024];
>
> for (i = 0; i < 10000000000LL; i++){
> sprintf(buf, "loooooooooooooooooooooooong string %d",i);
> settitle(buf);
> }
> }
> ==================================================
>
>Cc: Bryan Donlan <bdonlan(a)gmail.com>
>Cc: Ulrich Drepper <drepper(a)redhat.com>
>Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
>Signed-off-by: Timo Sirainen <tss(a)iki.fi>
>---
> fs/proc/base.c | 31 ++++++++++++++++++++++---------
> include/linux/mm_types.h | 2 ++
> include/linux/prctl.h | 3 +++
> kernel/fork.c | 1 +
> kernel/sys.c | 22 ++++++++++++++++++++++
> 5 files changed, 50 insertions(+), 9 deletions(-)
>
>diff --git a/fs/proc/base.c b/fs/proc/base.c
>index 837469a..ac800b4 100644
>--- a/fs/proc/base.c
>+++ b/fs/proc/base.c
>@@ -255,32 +255,45 @@ static int proc_pid_cmdline(struct task_struct *task, char * buffer)
> int res = 0;
> unsigned int len;
> struct mm_struct *mm = get_task_mm(task);
>+
> if (!mm)
> goto out;
>+
>+ /* The process was not constructed yet? */
> if (!mm->arg_end)
> goto out_mm; /* Shh! No looking before we're done */
>
>- len = mm->arg_end - mm->arg_start;
>-
>+ mutex_lock(&mm->arg_lock);
>+ len = mm->arg_end - mm->arg_start;
> if (len > PAGE_SIZE)
> len = PAGE_SIZE;
>-
>+
> res = access_process_vm(task, mm->arg_start, buffer, len, 0);
>+ if (mm->arg_end != mm->env_start)
>+ /* prctl(PR_SET_PROCTITLE_AREA) used */
>+ goto out_unlock;
>
>- // If the nul at the end of args has been overwritten, then
>- // assume application is using setproctitle(3).
>+ /*
>+ * If the nul at the end of args has been overwritten, then assume
>+ * application is using sendmail's SPT_REUSEARGV style argv override.
>+ */
> if (res > 0 && buffer[res-1] != '\0' && len < PAGE_SIZE) {
> len = strnlen(buffer, res);
>- if (len < res) {
>- res = len;
>- } else {
>+ if (len < res)
>+ res = len;
>+ else {
> len = mm->env_end - mm->env_start;
> if (len > PAGE_SIZE - res)
> len = PAGE_SIZE - res;
>- res += access_process_vm(task, mm->env_start, buffer+res, len, 0);
>+ res += access_process_vm(task, mm->env_start,
>+ buffer+res, len, 0);
> res = strnlen(buffer, res);
> }
> }
>+
>+out_unlock:
>+ mutex_unlock(&mm->arg_lock);
>+
> out_mm:
> mmput(mm);
> out:
>diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>index 84a524a..3e2a346 100644
>--- a/include/linux/mm_types.h
>+++ b/include/linux/mm_types.h
>@@ -12,6 +12,7 @@
> #include <linux/completion.h>
> #include <linux/cpumask.h>
> #include <linux/page-debug-flags.h>
>+#include <linux/mutex.h>
> #include <asm/page.h>
> #include <asm/mmu.h>
>
>@@ -236,6 +237,7 @@ struct mm_struct {
> unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
>+ struct mutex arg_lock;
> unsigned long arg_start, arg_end, env_start, env_end;
>
> unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>index 9311505..da47542 100644
>--- a/include/linux/prctl.h
>+++ b/include/linux/prctl.h
>@@ -90,4 +90,7 @@
>
> #define PR_MCE_KILL 33
>
>+/* Set process title memory area for setproctitle() */
>+#define PR_SET_PROCTITLE_AREA 34
>+
> #endif /* _LINUX_PRCTL_H */
>diff --git a/kernel/fork.c b/kernel/fork.c
>index 4c20fff..881a6b4 100644
>--- a/kernel/fork.c
>+++ b/kernel/fork.c
>@@ -459,6 +459,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> mm->cached_hole_size = ~0UL;
> mm_init_aio(mm);
> mm_init_owner(mm, p);
>+ mutex_init(&mm->arg_lock);
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
>diff --git a/kernel/sys.c b/kernel/sys.c
>index 255475d..434ea13 100644
>--- a/kernel/sys.c
>+++ b/kernel/sys.c
>@@ -1564,6 +1564,28 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = 0;
> break;
>
>+ case PR_SET_PROCTITLE_AREA: {
>+ struct mm_struct *mm = current->mm;
>+ unsigned long addr = arg2;
>+ unsigned long len = arg3;
>+ unsigned long end = arg2 + arg3;
>+
>+ if (len > PAGE_SIZE)
>+ return -EINVAL;
>+
>+ if (addr >= end)
>+ return -EINVAL;
>+
>+ if (!access_ok(VERIFY_READ, addr, len))
>+ return -EFAULT;
>+
>+ mutex_lock(&mm->arg_lock);
>+ mm->arg_start = addr;

Is this safe? You're assigning a user-space pointer to kernel space...
Don't we need copy_from_user()?

>+ mm->arg_end = addr + len;


Since you already have 'end', no need to caculate this again. :)


>+ mutex_unlock(&mm->arg_lock);
>+
>+ return 0;
>+ }
> default:
> error = -EINVAL;
> break;


--
Live like a child, think like the god.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/