From: Divyesh Shah on
On Sat, Feb 27, 2010 at 3:43 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
>
> On Fri, 2010-02-26 at 18:03 -0800, Divyesh Shah wrote:
> > This can be used by applications to get finer granularity cputime usage on
> > platforms that use timestamp counters or HPET.
>
> I guess the patch looks good, I'm just not sure what HPET got to do with
> anything.. the scheduler certainly doesn't use HPET for timekeeping, its
> terribly slow to read.

Yes you're right. Please ignore the HPET comment.

>
> Also, it would be good to get some more justification than 'some
> applications can use this', which is basically a truism for any patch
> that adds a user interface.

1) This should be useful for a shared or virtualized environment where
the shared management infrastructure needs to charge each application
as accurately as possible for cpu usage.
2) An application that services different users and does some
cpu-intensive work may want to measure cpu time spent for each user by
the serving thread.

I think applications like web servers, remote database servers, etc.
fit into the second category.

For units of work smaller than a jiffy, this really helps as some
threads could potentially hide from the jiffy based accounting.

Please let me know if you want me to send the patch again with the
corrected description and added justification.

>
> > Signed-off-by: Divyesh Shah<dpshah(a)google.com>
> > ---
> >
> > �fs/proc/array.c � �| � 40 ++++++++++++++++++++++++++++++++++++++++
> > �fs/proc/base.c � � | � �2 ++
> > �fs/proc/internal.h | � �2 ++
> > �3 files changed, 44 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index 13b5d07..54604b8 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -547,3 +547,43 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
> >
> > � � � return 0;
> > �}
> > +
> > +static int do_task_cputime(struct task_struct *task, char * buffer, int whole)
> > +{
> > + � � int res;
> > + � � unsigned long flags;
> > + � � unsigned long long sum_exec_runtime = 0;
> > + � � struct task_struct *t;
> > +
> > + � � if (lock_task_sighand(task, &flags)) {
> > + � � � � � � if (whole) {
> > + � � � � � � � � � � t = task;
> > + � � � � � � � � � � /*
> > + � � � � � � � � � � �* Add up live thread sum_exec_runtime at the group
> > + � � � � � � � � � � �* level.
> > + � � � � � � � � � � �*/
> > + � � � � � � � � � � do {
> > + � � � � � � � � � � � � � � sum_exec_runtime += t->se.sum_exec_runtime;
> > + � � � � � � � � � � � � � � t = next_thread(t);
> > + � � � � � � � � � � } while (t != task);
> > + � � � � � � � � � � sum_exec_runtime += task->signal->sum_sched_runtime;
> > + � � � � � � }
> > + � � � � � � unlock_task_sighand(task, &flags);
> > + � � }
> > +
> > + � � if (!whole)
> > + � � � � � � sum_exec_runtime = task->se.sum_exec_runtime;
> > +
> > + � � res = sprintf(buffer,"%llu\n", sum_exec_runtime);
> > + � � return res;
> > +}
> > +
> > +int proc_tid_cputime(struct task_struct *task, char * buffer)
> > +{
> > + � � return do_task_cputime(task, buffer, 0);
> > +}
> > +
> > +int proc_tgid_cputime(struct task_struct *task, char * buffer)
> > +{
> > + � � return do_task_cputime(task, buffer, 1);
> > +}
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 58324c2..8fbc785 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -2595,6 +2595,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> > � � � REG("numa_maps", �S_IRUGO, proc_numa_maps_operations),
> > �#endif
> > � � � REG("mem", � � � �S_IRUSR|S_IWUSR, proc_mem_operations),
> > + � � INF("cputime_ns", S_IRUGO, proc_tgid_cputime),
> > � � � LNK("cwd", � � � �proc_cwd_link),
> > � � � LNK("root", � � � proc_root_link),
> > � � � LNK("exe", � � � �proc_exe_link),
> > @@ -2930,6 +2931,7 @@ static const struct pid_entry tid_base_stuff[] = {
> > � � � REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
> > �#endif
> > � � � REG("mem", � � � S_IRUSR|S_IWUSR, proc_mem_operations),
> > + � � INF("cputime_ns", S_IRUGO, proc_tid_cputime),
> > � � � LNK("cwd", � � � proc_cwd_link),
> > � � � LNK("root", � � �proc_root_link),
> > � � � LNK("exe", � � � proc_exe_link),
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 1f24a3e..f9e9799 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -51,6 +51,8 @@ extern int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
> > � � � � � � � � � � � � � � � struct pid *pid, struct task_struct *task);
> > �extern int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
> > � � � � � � � � � � � � � � � struct pid *pid, struct task_struct *task);
> > +extern int proc_tid_cputime(struct task_struct *, char *);
> > +extern int proc_tgid_cputime(struct task_struct *, char *);
> > �extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
> >
> > �extern const struct file_operations proc_maps_operations;
> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Divyesh Shah on
On Mon, Mar 1, 2010 at 10:05 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> On Mon, 2010-03-01 at 07:44 -0800, Divyesh Shah wrote:
>> On Sat, Feb 27, 2010 at 3:43 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
>> >
>> > On Fri, 2010-02-26 at 18:03 -0800, Divyesh Shah wrote:
>> > > This can be used by applications to get finer granularity cputime usage on
>> > > platforms that use timestamp counters or HPET.
>> >
>> > I guess the patch looks good, I'm just not sure what HPET got to do with
>> > anything.. the scheduler certainly doesn't use HPET for timekeeping, its
>> > terribly slow to read.
>>
>> Yes you're right. Please ignore the HPET comment.
>>
>> >
>> > Also, it would be good to get some more justification than 'some
>> > applications can use this', which is basically a truism for any patch
>> > that adds a user interface.
>>
>> 1) This should be useful for a shared or virtualized environment where
>> the shared management infrastructure needs to charge each application
>> as accurately as possible for cpu usage.
>> 2) An application that services different users and does some
>> cpu-intensive work may want to measure cpu time spent for each user by
>> the serving thread.
>>
>> I think applications like web servers, remote database servers, etc.
>> fit into the second category.
>>
>> For units of work smaller than a jiffy, this really helps as some
>> threads could potentially hide from the jiffy based accounting.
>>
>> Please let me know if you want me to send the patch again with the
>> corrected description and added justification.
>
>
> Its all should and may, anything concrete?

Both of the use cases above are what we've been using (the shared
environment case) and plan on using this interface for at Google. So
they are indeed concrete use-cases. It should be useful for similar
scenarios to other users in the linux community.

>
> Also, doesn't muck like schedstat, task_delay_accounting, cpuacct-cgroup
> or some other stuff expose this number already?

I took a look at /proc/<pid>/schedstat. It does export this number
along with some other info but it doesn't do any aggregation of
threads when the pid is a thread group leader and also is less
accurate than cputime_ns as it does not include the cputime since last
tick. I have a patch that adds both of these but it I am not sure if
this affects any userspace tools which depend on the semantics of the
existing interface. Please advise whether it makes sense to change the
shcedstat interface or add this one.

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/