From: Andrew Morton on
On Fri, 11 Jun 2010 15:49:54 -0700
Salman <sqazi(a)google.com> wrote:

> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program. This is really bad for bash implementation. Furthermore,
> many shell scripts assume that pid numbers will not be used for some length
> of time.
>
> Race Description:
>
> ...
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..fbbd5f6 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -122,6 +122,43 @@ static void free_pidmap(struct upid *upid)
> atomic_inc(&map->nr_free);
> }
>
> +/*
> + * If we started walking pids at 'base', is 'a' seen before 'b'?
> + */
> +static int pid_before(int base, int a, int b)
> +{
> + /*
> + * This is the same as saying
> + *
> + * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
> + * and that mapping orders 'a' and 'b' with respect to 'base'.
> + */
> + return (unsigned)(a - base) < (unsigned)(b - base);
> +}

pid.c uses an exotic mix of `int' and `pid_t' to represent pids. `int'
seems to preponderate.

> +/*
> + * We might be racing with someone else trying to set pid_ns->last_pid.
> + * We want the winner to have the "later" value, because if the
> + * "earlier" value prevails, then a pid may get reused immediately.
> + *
> + * Since pids rollover, it is not sufficient to just pick the bigger
> + * value. We have to consider where we started counting from.
> + *
> + * 'base' is the value of pid_ns->last_pid that we observed when
> + * we started looking for a pid.
> + *
> + * 'pid' is the pid that we eventually found.
> + */
> +static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
> +{
> + int prev;
> + int last_write = base;
> + do {
> + prev = last_write;
> + last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
> + } while ((prev != last_write) && (pid_before(base, last_write, pid)));
> +}

<gets distracted>

hm. For a long time cmpxchg() wasn't available on all architectures.
That _seems_ to have been fixed.

arch/score assumes that cmpxchg() operates on unsigned longs.

arch/powerpc plays the necessary games to make 4- and 8-byte scalars work.

ia64 handles 1, 2, 4 and 8-byte quantities.

arm handles 1, 2 and 4-byte scalars.

as does blackfin.

So from the few architectures I looked at, it seems that we do indeed
handle cmpxchg() on all architectures although not very consistently.
arch/score will blow up if someone tries to use cmpxchg() on 1- or
2-byte scalars.

<looks at the consumers>

infiniband deos cmpxchg() on u64*'s, which will blow up on many
architectures.

Using

grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/

I can't see any cmpxchg() callers in truly generic code. lockdep and
kernel/trace/ring_buffer.c aren't used on the more remote
architectures, I think.

Traditionally, atomic_cmpxchg() was the safe and portable one to use.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: tytso on
On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote:
> Using
>
> grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/
>
> I can't see any cmpxchg() callers in truly generic code. lockdep and
> kernel/trace/ring_buffer.c aren't used on the more remote
> architectures, I think.

What about:

drivers/gpu/drm/drm_lock.c: prev = cmpxchg(lock, old, new);
kernel/lockdep.c: n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de
kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
fs/btrfs/inode.c: if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN
fs/ext4/inode.c: } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl

The last is quite new --- I had just recently done a similar set of
research as you did before accepting the patch that added cmpxchg into
ext4 (during the last merge window), and I thought cmpxchg() had
entered the "supported by all architectures" category. It looked like
it had only recently reached state, but I had reached the conclusion
that it was safe to use.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Mon, 14 Jun 2010 20:56:19 -0400 tytso(a)mit.edu wrote:

> On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote:
> > Using
> >
> > grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/
> >
> > I can't see any cmpxchg() callers in truly generic code. lockdep and
> > kernel/trace/ring_buffer.c aren't used on the more remote
> > architectures, I think.
>
> What about:
>
> drivers/gpu/drm/drm_lock.c: prev = cmpxchg(lock, old, new);
> kernel/lockdep.c: n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de

I put these in the not-used-on-weird-architectures bucket.

> kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc

I guess that'll flush out any stragglers.

I suspect sched_clock.c might be generating fair amounts of code which
UP builds don't need.

> fs/btrfs/inode.c: if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN
> fs/ext4/inode.c: } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl
>
> The last is quite new --- I had just recently done a similar set of
> research as you did before accepting the patch that added cmpxchg into
> ext4 (during the last merge window), and I thought cmpxchg() had
> entered the "supported by all architectures" category. It looked like
> it had only recently reached state, but I had reached the conclusion
> that it was safe to use.

I think you're probably right, as long as one sticks with 4-byte
scalars. The cmpxchg-is-now-generic change snuck in under the radar
(mine, at least).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Paul Mackerras on
On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:

> > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
>
> I guess that'll flush out any stragglers.

And break most non-x86 32-bit architectures, including 32-bit powerpc.
Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
is set, and it looks like only x86 and ia64 set it.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Tue, 15 Jun 2010 13:26:08 +1000 Paul Mackerras <paulus(a)samba.org> wrote:

> On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
>
> > > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> >
> > I guess that'll flush out any stragglers.
>
> And break most non-x86 32-bit architectures, including 32-bit powerpc.

If CONFIG_SMP=y, yes. On UP there's a generic implementation
(include/asm-generic/cmpxchg-local.h, include/asm-generic/cmpxchg.h)

> Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> is set, and it looks like only x86 and ia64 set it.
>

If that happens then the best fix is for those architectures to get
themselves a cmpxchg64(). Unless for some reason it's simply
unimplementable? Worst case I guess one could use a global spinlock.
Second-worst-case: hashed spinlocks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/