Future tracing/instrumentation directions [Kernel]

Prev: random: simplify fips mode
Next: sched: change nohz idle load balancing logic to push model

From: Steven Rostedt on 20 May 2010 16:20

On Thu, 2010-05-20 at 10:38 -0400, Steven Rostedt wrote:
> On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote:

> # modprobe ring_buffer_benchmark producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
>
>
> <...>-2636 [003] 761.969850: ring_buffer_producer_thread: Starting ring buffer hammer
> <...>-2636 [003] 771.910121: ring_buffer_producer_thread: End ring buffer hammer
> <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19
> <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
> <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Time: 9965321 (usecs)
> <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Overruns: 0
> <...>-2636 [003] 771.910174: ring_buffer_producer_thread: Read: 76580325 (by pages)
> <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Entries: 75
> <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Total: 76580400
> <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Missed: 0
> <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Hit: 76580400
> <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Entries per millisec: 7684
> <...>-2636 [003] 771.910176: ring_buffer_producer_thread: 130 ns per entry
> <...>-2636 [003] 771.910177: ring_buffer_producer_thread: Sleeping for 10 secs
>
> <...>-2636 [003] 781.884686: ring_buffer_producer_thread: Starting ring buffer hammer
> <...>-2636 [003] 790.862392: ring_buffer_producer_thread: End ring buffer hammer
> <...>-2636 [003] 790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19
> <...>-2636 [003] 790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
> <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Time: 9000321 (usecs)
> <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Overruns: 26581200
> <...>-2636 [003] 790.897402: ring_buffer_producer_thread: Read: 6767200 (by events)
> <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Entries: 0
> <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Total: 33348400
> <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Missed: 0
> <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Hit: 33348400
> <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Entries per millisec: 3705
> <...>-2636 [003] 790.897405: ring_buffer_producer_thread: 269 ns per entry
>
> While having the reader read by pages (splice mode) a write takes 130ns
> per entry. Not that bad.
>
> But when the reader is reading each event, (I'm assuming we get some
> cache line bouncing here, it is still lockless), a write takes 269ns.
>
> Note, the reader is an aggressive reader. That is, it goes into a strong
> loop and just reads as fast as it can. Which would cause cache line
> bouncing.
>
> When I disable the reader and just measure what a write takes:
>
> # rmmod ring_buffer_benchmark
> # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
>
> <...>-2649 [003] 1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer
> <...>-2649 [003] 1151.950840: ring_buffer_producer_thread: End ring buffer hammer
> <...>-2649 [003] 1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
> <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Time: 9994238 (usecs)
> <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Overruns: 90635670
> <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Read: (reader disabled)
> <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Entries: 62480
> <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Total: 90698150
> <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Missed: 0
> <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Hit: 90698150
> <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075
> <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: 110 ns per entry
> <...>-2649 [003] 1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs
>
> It takes just a 110ns.

I checked out latest tip/perf/core with commit:

commit dfacc4d6c98b89609250269f518c1f54c30454ef
Merge: f869097 85cb68b
Author: Ingo Molnar <mingo(a)elte.hu>
Date: Thu May 20 14:38:55 2010 +0200

Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core

This seems to include the latest optimizations from Peter.

I decided to modify the trace/kernel/ring_buffer_benchmark.c to
benchmark the perf ring buffer. I spent an hour or so trying to get it
to work but since the perf ring buffer is so coupled with perf, I gave
up in trying to test it directly. It requires a reader, and it has no
good way to set up a buffer and read it from the kernel.

So instead, I figured I would test perf and ftrace with TRACE_EVENT().

Here's what I did, and my changes are at the bottom of this email as a
patch too.

I created a function within the kernel proper (since the benchmark is a
module) called ring_buffer_benchmark_caller(). All this does is call:

trace_rb_benchmark(raw_smp_processor_id());

Where trace_rb_benchmark is created with TRACE_EVENT() (as shown in the
patch below).

I modified the ring buffer benchmark to take a "trace_event" parameter
where instead of testing the ring buffer itself, it would just time the
calling of "ring_buffer_benchmark_caller()", which includes the trace
event.

This is nice way to see the difference between perf and ftrace since
this is the code that they both use.

Note, I noticed that the ring buffer slowed down since 2.6.34. Probably
to do with the "lost events" recording that I added. But it went from
110ns to 153ns in the normal benchmark (not this one).

Anyway, the benchmark would pause until the tracepoint was activated, as
the tracepoint would enable rb_start_bench. When the tracepoint is
called the first time, the benchmark would begin.

Here's what I did:

# modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1
# perf record -e rb_bench:rb_benchmark -c1 -a sleep 30
# cat /debug/tracing/trace

Here's the results:

# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-22836 [001] 2804.849264: : wait to start
<...>-22836 [001] 2816.419909: : starting
<...>-22836 [001] 2816.419911: : Starting ring buffer hammer
<...>-22836 [002] 2826.056105: : End ring buffer hammer
<...>-22836 [002] 2826.056108: : Running Producer at SCHED_FIFO 10
<...>-22836 [002] 2826.056109: : Time: 9660458 (usecs)
<...>-22836 [002] 2826.056110: : Overruns: 0
<...>-22836 [002] 2826.056110: : Read: (reader disabled)
<...>-22836 [002] 2826.056111: : Entries: 0
<...>-22836 [002] 2826.056111: : Total: 0
<...>-22836 [002] 2826.056112: : Missed: 0
<...>-22836 [002] 2826.056112: : Hit: 14637150
<...>-22836 [002] 2826.056113: : Entries per millisec: 1515
<...>-22836 [002] 2826.056113: : 660 ns per entry
<...>-22836 [002] 2826.056114: : Sleeping for 10 secs
<...>-22836 [002] 2836.030464: : wait to start
<...>-22836 [002] 2836.030465: : starting
<...>-22836 [002] 2836.030465: : Starting ring buffer hammer
<...>-22836 [001] 2845.008379: : End ring buffer hammer
<...>-22836 [001] 2845.008383: : Running Producer at SCHED_FIFO 10
<...>-22836 [001] 2845.008384: : Time: 9000518 (usecs)
<...>-22836 [001] 2845.008384: : Overruns: 0
<...>-22836 [001] 2845.008384: : Read: (reader disabled)
<...>-22836 [001] 2845.008385: : Entries: 0
<...>-22836 [001] 2845.008385: : Total: 0
<...>-22836 [001] 2845.008386: : Missed: 0
<...>-22836 [001] 2845.008387: : Hit: 13642200
<...>-22836 [001] 2845.008387: : Entries per millisec: 1515
<...>-22836 [001] 2845.008388: : 660 ns per entry

Writing an event in perf took 660ns each.

To do the same for ftrace, I first had to set a filter. This is because
the events for ftrace write to the same place the ring buffer benchmark
is writing its output, and I would have to search through thousands of
called events to see the output. But since filtering is actually
expensive, it should still give us a basic idea.

# modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1
# echo mydata == 10 > /debug/tracing/events/rb_bench/rb_benchmark/filter
# echo 1 > /debug/tracing/events/rb_bench/rb_benchmark/enable
# cat /debug/tracing/trace

# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
rb_producer-22849 [000] 3015.236994: ring_buffer_producer_thread: wait to start
rb_producer-22849 [000] 3032.193995: ring_buffer_producer_thread: starting
rb_producer-22849 [000] 3032.193996: ring_buffer_producer_thread: Starting ring buffer hammer
rb_producer-22849 [000] 3041.513072: ring_buffer_producer_thread: End ring buffer hammer
rb_producer-22849 [000] 3041.513073: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
rb_producer-22849 [000] 3041.513074: ring_buffer_producer_thread: Time: 9342579 (usecs)
rb_producer-22849 [000] 3041.513074: ring_buffer_producer_thread: Overruns: 0
rb_producer-22849 [000] 3041.513075: ring_buffer_producer_thread: Read: (reader disabled)
rb_producer-22849 [000] 3041.513075: ring_buffer_producer_thread: Entries: 0
rb_producer-22849 [000] 3041.513076: ring_buffer_producer_thread: Total: 0
rb_producer-22849 [000] 3041.513076: ring_buffer_producer_thread: Missed: 0
rb_producer-22849 [000] 3041.513077: ring_buffer_producer_thread: Hit: 51588000
rb_producer-22849 [000] 3041.513077: ring_buffer_producer_thread: Entries per millisec: 5522
rb_producer-22849 [000] 3041.513078: ring_buffer_producer_thread: 181 ns per entry
rb_producer-22849 [000] 3041.513078: ring_buffer_producer_thread: Sleeping for 10 secs
rb_producer-22849 [000] 3051.487356: ring_buffer_producer_thread: wait to start
rb_producer-22849 [000] 3051.487356: ring_buffer_producer_thread: starting
rb_producer-22849 [000] 3051.487357: ring_buffer_producer_thread: Starting ring buffer hammer
rb_producer-22849 [000] 3060.465270: ring_buffer_producer_thread: End ring buffer hammer
rb_producer-22849 [000] 3060.465271: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
rb_producer-22849 [000] 3060.465272: ring_buffer_producer_thread: Time: 9000562 (usecs)
rb_producer-22849 [000] 3060.465272: ring_buffer_producer_thread: Overruns: 0
rb_producer-22849 [000] 3060.465272: ring_buffer_producer_thread: Read: (reader disabled)
rb_producer-22849 [000] 3060.465273: ring_buffer_producer_thread: Entries: 0
rb_producer-22849 [000] 3060.465273: ring_buffer_producer_thread: Total: 0
rb_producer-22849 [000] 3060.465274: ring_buffer_producer_thread: Missed: 0
rb_producer-22849 [000] 3060.465274: ring_buffer_producer_thread: Hit: 49675400
rb_producer-22849 [000] 3060.465275: ring_buffer_producer_thread: Entries per millisec: 5519
rb_producer-22849 [000] 3060.465275: ring_buffer_producer_thread: 181 ns per entry

Note, enabling an event in ftrace enables the comm line record, which
shows the "rb_producer".

Here we see each event took 181ns as called by TRACE_EVENT() and then
discarded.

But since I'm discarding the events this may be giving ftrace an unfair
advantage. Thus, I modified the ring buffer benchmark code as follows:

#undef trace_printk
#define trace_printk(fmt, args...) printk(fmt, ##args)

This makes the ring buffer benchmark write to the console instead of the
ring buffer, and lets me run without filtering.

# modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1
# echo 0 > /debug/tracing/events/rb_bench/rb_benchmark/filter
# echo 1 > /debug/tracing/events/rb_bench/rb_benchmark/enable

I got this:

wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time: 9972749 (usecs)
Overruns: 0
Read: (reader disabled)
Entries: 0
Total: 0
Missed: 0
Hit: 57079200
Entries per millisec: 5723
174 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time: 9972786 (usecs)
Overruns: 0
Read: (reader disabled)
Entries: 0
Total: 0
Missed: 0
Hit: 57086350
Entries per millisec: 5724
174 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time: 9972772 (usecs)
Overruns: 0
Read: (reader disabled)
Entries: 0
Total: 0
Missed: 0
Hit: 57058250
Entries per millisec: 5721
174 ns per entry
Sleeping for 10 secs

The non-filtering case (the one that actually records) was 174ns per event.

Here's the patch:

-- Steve

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index ffb1a5b..3b559d3 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -21,10 +21,12 @@ endif
#
obj-y += trace_clock.o

+CFLAGS_ring_buffer_bm_event.o += -I$(src)
+
obj-$(CONFIG_FUNCTION_TRACER) += libftrace.o
obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o
-
+obj-y += ring_buffer_bm_event.o
obj-$(CONFIG_TRACING) += trace.o
obj-$(CONFIG_TRACING) += trace_output.o
obj-$(CONFIG_TRACING) += trace_stat.o
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 302f8a6..9d5cdb1 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -4,12 +4,23 @@
* Copyright (C) 2009 Steven Rostedt <srostedt(a)redhat.com>
*/
#include <linux/ring_buffer.h>
+#include <linux/ftrace_event.h>
#include <linux/completion.h>
#include <linux/kthread.h>
#include <linux/module.h>
#include <linux/time.h>
#include <asm/local.h>

+#include "ring_buffer_bm_event.h"
+
+/* Write the strings directly into the buffers. */
+#undef trace_printk
+#define trace_printk(fmt, args...) \
+do { \
+ __trace_printk_check_format(fmt, ##args); \
+ __trace_printk(_THIS_IP_, fmt, ##args); \
+} while (0)
+
struct rb_page {
u64 ts;
local_t commit;
@@ -32,6 +43,10 @@ static struct task_struct *producer;
static struct task_struct *consumer;
static unsigned long read;

+static int trace_event;
+module_param(trace_event, uint, 0644);
+MODULE_PARM_DESC(trace_event, "record a trace event");
+
static int disable_reader;
module_param(disable_reader, uint, 0644);
MODULE_PARM_DESC(disable_reader, "only run producer");
@@ -211,6 +226,50 @@ static void ring_buffer_consumer(void)
complete(&read_done);
}

+static inline int ring_buffer_write_trace_event(void)
+{
+ ring_buffer_benchmark_caller();
+ return 0;
+}
+
+static inline int ring_buffer_write_rb_event(void)
+{
+ struct ring_buffer_event *event;
+ int *entry;
+
+ event = ring_buffer_lock_reserve(buffer, 10);
+ if (!event)
+ return -1;
+
+ entry = ring_buffer_event_data(event);
+ *entry = smp_processor_id();
+ ring_buffer_unlock_commit(buffer, event);
+
+ return 0;
+}
+
+static inline int ring_buffer_write_event(void)
+{
+ if (trace_event)
+ return ring_buffer_write_trace_event();
+ else
+ return ring_buffer_write_rb_event();
+}
+
+static int wait_to_start(void)
+{
+ while (!rb_start_bench) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (kthread_should_stop())
+ break;
+ schedule_timeout(HZ/10);
+ ring_buffer_benchmark_caller();
+ }
+ __set_current_state(TASK_RUNNING);
+
+ return kthread_should_stop();
+}
+
static void ring_buffer_producer(void)
{
struct timeval start_tv;
@@ -223,6 +282,13 @@ static void ring_buffer_producer(void)
unsigned long avg;
int cnt = 0;

+ if (trace_event) {
+ trace_printk("wait to start\n");
+ if (wait_to_start())
+ return;
+ trace_printk("starting\n");
+ }
+
/*
* Hammer the buffer for 10 secs (this may
* make the system stall)
@@ -230,20 +296,13 @@ static void ring_buffer_producer(void)
trace_printk("Starting ring buffer hammer\n");
do_gettimeofday(&start_tv);
do {
- struct ring_buffer_event *event;
- int *entry;
int i;

for (i = 0; i < write_iteration; i++) {
- event = ring_buffer_lock_reserve(buffer, 10);
- if (!event) {
+ if (ring_buffer_write_event() < 0)
missed++;
- } else {
+ else
hit++;
- entry = ring_buffer_event_data(event);
- *entry = smp_processor_id();
- ring_buffer_unlock_commit(buffer, event);
- }
}
do_gettimeofday(&end_tv);

@@ -425,6 +484,12 @@ static int __init ring_buffer_benchmark_init(void)
if (!buffer)
return -ENOMEM;

+ /* The reader is userspace */
+ if (trace_event) {
+ disable_reader = 1;
+ rb_start_bench = 0;
+ }
+
if (!disable_reader) {
consumer = kthread_create(ring_buffer_consumer_thread,
NULL, "rb_consumer");
diff --git a/kernel/trace/ring_buffer_bm_event.c b/kernel/trace/ring_buffer_bm_event.c
new file mode 100644
index 0000000..862e80c
--- /dev/null
+++ b/kernel/trace/ring_buffer_bm_event.c
@@ -0,0 +1,13 @@
+#include <linux/module.h>
+
+#define CREATE_TRACE_POINTS
+#include "ring_buffer_bm_event.h"
+
+int rb_start_bench;
+EXPORT_SYMBOL(rb_start_bench);
+
+void ring_buffer_benchmark_caller(void)
+{
+ trace_rb_benchmark(raw_smp_processor_id());
+}
+EXPORT_SYMBOL(ring_buffer_benchmark_caller);
diff --git a/kernel/trace/ring_buffer_bm_event.h b/kernel/trace/ring_buffer_bm_event.h
new file mode 100644
index 0000000..a167c6d
--- /dev/null
+++ b/kernel/trace/ring_buffer_bm_event.h
@@ -0,0 +1,38 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rb_bench
+
+#if !defined(_TRACE_RB_BENCHMARK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RB_BENCHMARK_H
+
+#include <linux/tracepoint.h>
+
+extern int rb_start_bench;
+void ring_buffer_benchmark_caller(void);
+
+TRACE_EVENT(rb_benchmark,
+
+ TP_PROTO(int mydata),
+
+ TP_ARGS(mydata),
+
+ TP_STRUCT__entry(
+ __field( int, mydata )
+ ),
+
+ TP_fast_assign(
+ rb_start_bench = 1;
+ __entry->mydata = mydata;
+ ),
+
+ TP_printk("data=%d", __entry->mydata)
+);
+
+#endif /* _TRACE_RB_BENCHMARK_H */
+
+#undef TRACE_INCLUDE_FILE
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE ring_buffer_bm_event
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Li Zefan on 21 May 2010 05:30

Ingo Molnar wrote:
> * Steven Rostedt <rostedt(a)goodmis.org> wrote:
>
>> More than a year and a half ago (September 2008), at
>> Linux Plumbers, we had a meeting with several kernel
>> developers to come up with a unified ring buffer. A
>> generic ring buffer in the kernel that any subsystem
>> could use. After coming up with a set of requirements, I
>> worked on implementing it. One of the requirements was
>> to start off simple and work to become a more complete
>> buffering system.
>>
>> [...]
>
> The thing is, in tracing land and more broadly in
> instrumentation land we have _much_ more earthly problems
> these days:
>
> - Lets face it, performance of the ring-buffer sucks, in
> a big way. I've recently benchmarked it and it takes
> hundreds of instructions to trace a single event.
> Puh-lease ...
>

We ran some benchmarks with all the trace events enabled except
lock and kmem events, and the results showed the overhead was
quite small and acceptable.

But that was 2.6.31, we didn't benchmark newer kernels which have
more tracepoints.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: random: simplify fips mode
Next: sched: change nohz idle load balancing logic to push model