From: Randy Dunlap on
On Sat, 1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev(a)linux.vnet.ibm.com>
>
> This gives a brief overview of the eclone() system call. We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev(a)linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue(a)us.ibm.com>
> Acked-by: Oren Laadan <orenl(a)cs.columbia.edu>
> ---
> Documentation/eclone | 348 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 348 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/eclone
>
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> + u64 clone_flags_high;
> + u64 child_stack;
> + u64 child_stack_size;
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> + u32 nr_pids;
> + u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> + pid_t * __user pids)
> +
> + In addition to doing everything that clone() system call does, the

that the clone()

> + eclone() system call:
> +
> + - allows additional clone flags (31 of 32 bits in the flags
> + parameter to clone() are in use)
> +
> + - allows user to specify a pid for the child process in its
> + active and ancestor pid namespaces.
> +
> + This system call is meant to be used when restarting an application
> + from a checkpoint. Such restart requires that the processes in the
> + application have the same pids they had when the application was
> + checkpointed. When containers are nested, the processes within the
> + containers exist in multiple pid namespaces and hence have multiple
> + pids to specify during restart.
> +
> + The @flags_low parameter is identical to the 'clone_flags' parameter
> + in existing clone() system call.

in the existing

> +
> + The fields in 'struct clone_args' are meant to be used as follows:
> +
> + u64 clone_flags_high:
> +
> + When eclone() supports more than 32 flags, the additional bits
> + in the clone_flags should be specified in this field. This
> + field is currently unused and must be set to 0.
> +
> + u64 child_stack;
> + u64 child_stack_size;
> +
> + These two fields correspond to the 'child_stack' fields in
> + clone() and clone2() (on IA64) system calls. The usage of
> + these two fields depends on the processor architecture.
> +
> + Most architectures use ->child_stack to pass-in a stack-pointer

to pass in

> + itself and don't need the ->child_stack_size field. On these
> + architectures the ->child_stack_size field must be 0.
> +
> + Some architectures, eg IA64, use ->child_stack to pass-in the

e.g. to pass in

> + base of the region allocated for stack. These architectures
> + must pass in the size of the stack-region in ->child_stack_size.

stack region

Seems unfortunate that different architectures use the fields differently.

> +
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> +
> + These two fields correspond to the 'parent_tid_ptr' and
> + 'child_tid_ptr' fields in the clone() system call

system call.

> +
> + u32 nr_pids;
> +
> + nr_pids specifies the number of pids in the @pids array
> + parameter to eclone() (see below). nr_pids should not exceed
> + the current nesting level of the calling process (i.e if the

i.e.

> + process is in init_pid_ns, nr_pids must be 1, if process is
> + in a pid namespace that is a child of init-pid-ns, nr_pids
> + cannot exceed 2, and so on).
> +
> + u32 reserved0;
> + u64 reserved1;
> +
> + These fields are intended to extend the functionality of the
> + eclone() in the future, while preserving backward compatibility.
> + They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> + The @cargs_size parameter specifes the sizeof(struct clone_args) and
> + is intended to enable extending this structure in the future, while
> + preserving backward compatibility. For now, this field must be set
> + to the sizeof(struct clone_args) and this size must match the kernel's
> + view of the structure.
> +
> + The @pids parameter defines the set of pids that should be assigned to
> + the child process in its active and ancestor pid namespaces. The
> + descendant pid namespaces do not matter since a process does not have a
> + pid in descendant namespaces, unless the process is in a new pid
> + namespace in which case the process is a container-init (and must have
> + the pid 1 in that namespace).
> +
> + See CLONE_NEWPID section of clone(2) man page for details about pid

of the clone(2)

> + namespaces.
> +
> + If a pid in the @pids list is 0, the kernel will assign the next
> + available pid in the pid namespace.
> +
> + If a pid in the @pids list is non-zero, the kernel tries to assign
> + the specified pid in that namespace. If that pid is already in use
> + by another process, the system call fails (see EBUSY below).
> +
> + The order of pids in @pids is oldest in pids[0] to youngest pid
> + namespace in pids[nr_pids-1]. If the number of pids specified in the
> + @pids list is fewer than the nesting level of the process, the pids
> + are applied from youngest namespace. i.e if the process is nested in

the youngest namespace. I.e.

> + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> + have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> + On success, the system call returns the pid of the child process in
> + the parent's active pid namespace.
> +
> + On failure, eclone() returns -1 and sets 'errno' to one of following
> + values (the child process is not created).
> +
> + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to
> + specify the pids in this call (if pids are not specifed
> + CAP_SYS_ADMIN is not required).
> +
> + EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds
> + the current nesting level of parent process

process.

> +
> + EINVAL Not all specified clone-flags are valid.
> +
> + EINVAL The reserved fields in the clone_args argument are not 0.
> +
> + EINVAL The child_stack_size field is not 0 (on architectures that
> + pass in a stack pointer in ->child_stack field)

field).

> +
> + EBUSY A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone 337
> +#define CLONE_NEWPID 0x20000000
> +#define CLONE_CHILD_SETTID 0x01000000
> +#define CLONE_PARENT_SETTID 0x00100000
> +#define CLONE_UNUSED 0x00001000
> +
> +#define STACKSIZE 8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> + u64 clone_flags_high;
> + u64 child_stack;
> + u64 child_stack_size;
> +
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> +
> + u32 nr_pids;
> +
> + u32 reserved0;
> +};
> +
> +#define exit _exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> + int *pids)
> +{
> + long retval;
> +
> + __asm__ __volatile__(
> + "movl %3, %%ebx\n\t" /* flags_low -> 1st (ebx) */
> + "movl %4, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/
> + "movl %5, %%edx\n\t" /* args_size -> 3rd (edx) */
> + "movl %6, %%edi\n\t" /* pids -> 4th (edi)*/
> +
> + "pushl %%ebp\n\t" /* save value of ebp */
> + "int $0x80\n\t" /* Linux/i386 system call */
> + "testl %0,%0\n\t" /* check return value */
> + "jne 1f\n\t" /* jump if parent */
> +
> + "popl %%esi\n\t" /* get subthread function */
> + "call *%%esi\n\t" /* start subthread function */
> + "movl %2,%0\n\t"
> + "int $0x80\n" /* exit system call: exit subthread */
> + "1:\n\t"
> + "popl %%ebp\t" /* restore parent's ebp */
> +
> + :"=a" (retval)
> +
> + :"0" (__NR_eclone),
> + "i" (__NR_exit),
> + "m" (flags_low),
> + "m" (clone_args),
> + "m" (args_size),
> + "m" (pids)
> + );
> +
> + if (retval < 0) {
> + errno = -retval;
> + retval = -1;
> + }
> + return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> + void *stack_base;
> + void **stack_top;
> +
> + stack_base = malloc(size + size);
> + if (!stack_base) {
> + perror("malloc()");
> + exit(1);
> + }
> +
> + stack_top = (void **)((char *)stack_base + (size - 4));
> + *--stack_top = child_arg;
> + *--stack_top = child_fn;
> +
> + return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> + int rc;
> +
> + rc = syscall(__NR_gettid, 0, 0, 0);
> + if (rc < 0) {
> + printf("rc %d, errno %d\n", rc, errno);
> + exit(1);
> + }
> + return rc;
> +}
> +
> +#define CHILD_TID1 377
> +#define CHILD_TID2 1177
> +#define CHILD_TID3 2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> + struct clone_args *cs = (struct clone_args *)arg;
> + int ctid;
> +
> + /* Verify we pushed the arguments correctly on the stack... */
> + if (arg != child_arg) {
> + printf("Child: Incorrect child arg pointer, expected %p,"
> + "actual %p\n", child_arg, arg);
> + exit(1);
> + }
> +
> + /* ... and that we got the thread-id we expected */
> + ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> + if (ctid != CHILD_TID1) {
> + printf("Child: Incorrect child tid, expected %d, actual %d\n",
> + CHILD_TID1, ctid);
> + exit(1);
> + } else {
> + printf("Child got the expected tid, %d\n", gettid());
> + }
> + sleep(2);
> +
> + printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> + exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> + unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> + int rc;
> + void *stack;
> + struct clone_args *ca = &clone_args;
> + int args_size;
> +
> + stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> + memset(ca, 0, sizeof(*ca));
> +
> + ca->child_stack = (u64)(unsigned long)stack;
> + ca->child_stack_size = (u64)0;
> + ca->child_tid_ptr = (u64)(unsigned long)&child_tid;
> + ca->nr_pids = nr_pids;
> +
> + args_size = sizeof(struct clone_args);
> + rc = eclone(flags_low, ca, args_size, pids_list);
> +
> + printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> + rc, errno);
> + return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> + int rc, pid, status;
> + unsigned long flags;
> + int nr_pids = 1;
> +
> + flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> + pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> + printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> + rc = waitpid(pid, &status, __WALL);
> + if (rc < 0) {
> + printf("waitpid(): rc %d, error %d\n", rc, errno);
> + } else {
> + printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> + gettid(), rc, status);
> +
> + if (WIFEXITED(status)) {
> + printf("\t EXITED, %d\n", WEXITSTATUS(status));
> + } else if (WIFSIGNALED(status)) {
> + printf("\t SIGNALED, %d\n", WTERMSIG(status));
> + }
> + }
> + return 0;
> +}
> --


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Sukadev Bhattiprolu on
Randy Dunlap [randy.dunlap(a)oracle.com] wrote:
| > + base of the region allocated for stack. These architectures
| > + must pass in the size of the stack-region in ->child_stack_size.
|
| stack region
|
| Seems unfortunate that different architectures use the fields differently.

Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.

|
| Is this example program meant to build only on i386?

Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in

git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

for other architectures (currently x86_64, ppc, s390).

Thanks for the review. Will fix the errors and repost.

Sukadev
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Albert Cahalan on
Sukadev Bhattiprolu writes:

> Randy Dunlap [randy.dunlap at oracle.com] wrote:
>>> base of the region allocated for stack. These architectures
>>> must pass in the size of the stack-region in ->child_stack_size.
>>
>> stack region
>>
>> Seems unfortunate that different architectures use
>> the fields differently.
>
> Yes and no. The field still has a single purpose, just that
> some architectures may not need it. We enforce that if unused
> on an architecture, the field must be 0. It looked like
> the easiest way to keep the API common across architectures.

Yuck. You're forcing userspace to have #ifdef messes or,
more likely, just not work on all architectures. There is
no reason to have field usage vary by architecture. The
original clone syscall was not designed with ia64 and hppa
in mind, and has been causing trouble ever since. Let's not
perpetuate the problem.

Given code like this: stack_base = malloc(stack_size);
stack_base and stack_size are what the kernel needs.

I suspect that you chose the defective method for some reason
related to restarting processes that were created with the
older system calls. I can't say most of us even care, but in
that broken-already case your process restarter can make up
some numbers that will work. (for i386, the base could be the
lowest address in the vma in which %esp lies, or even address 0)

A related issue is that stack allocation and deallocation can
be quite painful: it is difficult (some assembly required) to
free one's own stack, and impossible if one is already dead.
We could use a flag to let the kernel handle allocation, with
the stack getting freed just after any ptracer gets a last look.
This issue is especially troublesome for me because the syscall
essentially requires per-thread memory to work; it is currently
extremely difficult to use the syscall in code which lacks that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Sukadev Bhattiprolu on
Albert Cahalan [acahalan(a)gmail.com] wrote:
| Sukadev Bhattiprolu writes:
|
| > Randy Dunlap [randy.dunlap at oracle.com] wrote:
| >>> base of the region allocated for stack. These architectures
| >>> must pass in the size of the stack-region in ->child_stack_size.
| >>
| >> stack region
| >>
| >> Seems unfortunate that different architectures use
| >> the fields differently.
| >
| > Yes and no. The field still has a single purpose, just that
| > some architectures may not need it. We enforce that if unused
| > on an architecture, the field must be 0. It looked like
| > the easiest way to keep the API common across architectures.
|
| Yuck. You're forcing userspace to have #ifdef messes or,
| more likely, just not work on all architectures.

There is going to be #ifdef code in the library interface to eclone().
But applications should not need any #ifdefs. Please see the test cases
for eclone in

git://git.sr71.net/~hallyn/cr_tests.git

There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

These use the libeclone.a built from following git-tree, which has the
arch-dependent user space code.

git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Is that the #ifdef mess you are talking about ? I don't see that as a
consequence of the API. So maybe you can elaborate.

| There is no reason to have field usage vary by architecture. The

The field usage does not vary by architecture. Some architectures
don't use some fields and those fields must be 0. A simple

memset(&clone_args, 0, sizeof(clone_args))

before initializing fields is all that is required.

| original clone syscall was not designed with ia64 and hppa
| in mind, and has been causing trouble ever since. Let's not
| perpetuate the problem.

and lot of folks contributed to this new API to try and make sure
it is portable and meets the forseeable requirements.

|
| Given code like this: stack_base = malloc(stack_size);
| stack_base and stack_size are what the kernel needs.
|
| I suspect that you chose the defective method for some reason
| related to restarting processes that were created with the
| older system calls. I can't say most of us even care, but in
| that broken-already case your process restarter can make up
| some numbers that will work. (for i386, the base could be the
| lowest address in the vma in which %esp lies, or even address 0)

I don't understand how "making up some numbers (pids) that will work"
is more portable/cleaner than the proposed eclone().

Sukadev
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Albert Cahalan on
On Tue, Jun 1, 2010 at 3:32 PM, Sukadev Bhattiprolu
<sukadev(a)linux.vnet.ibm.com> wrote:
> Albert Cahalan [acahalan(a)gmail.com] wrote:
> | Sukadev Bhattiprolu writes:
> | > Randy Dunlap [randy.dunlap at oracle.com] wrote:

> | >>> base of the region allocated for stack. These architectures
> | >>> must pass in the size of the stack-region in ->child_stack_size.
> | >>
> | >> stack region
> | >>
> | >> Seems unfortunate that different architectures use
> | >> the fields differently.
> | >
> | > Yes and no. The field still has a single purpose, just that
> | > some architectures may not need it. We enforce that if unused
> | > on an architecture, the field must be 0. It looked like
> | > the easiest way to keep the API common across architectures.
> |
> | Yuck. You're forcing userspace to have #ifdef messes or,
> | more likely, just not work on all architectures.
>
> There is going to be #ifdef code in the library interface to eclone().
> But applications should not need any #ifdefs. Please see the test cases
> for eclone in
>
> git://git.sr71.net/~hallyn/cr_tests.git
>
> There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

Come on, seriously, you know it's ia64 and hppa that
have issues. Maybe the nommu ports also have issues.

The only portable way to specify the stack is base and offset,
with flags or magic values for "share" and "kernel managed".

> | There is no reason to have field usage vary by architecture. The
>
> The field usage does not vary by architecture. Some architectures
> don't use some fields and those fields must be 0.

It looks like you contradict yourself. Please explain how
those two sentences are compatible.

> | original clone syscall was not designed with ia64 and hppa
> | in mind, and has been causing trouble ever since. Let's not
> | perpetuate the problem.
>
> and lot of folks contributed to this new API to try and make sure
> it is portable and meets the forseeable requirements.

Right, and some folks were ignored.

> | Given code like this: stack_base = malloc(stack_size);
> | stack_base and stack_size are what the kernel needs.
> |
> | I suspect that you chose the defective method for some reason
> | related to restarting processes that were created with the
> | older system calls. I can't say most of us even care, but in
> | that broken-already case your process restarter can make up
> | some numbers that will work. (for i386, the base could be the
> | lowest address in the vma in which %esp lies, or even address 0)
>
> I don't understand how "making up some numbers (pids) that will work"
> is more portable/cleaner than the proposed eclone().

It isolates the cross-platform problems to an obscure tool
instead of polluting the kernel interface that everybody uses.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/