From: Ingo Molnar on

* Peter Zijlstra <peterz(a)infradead.org> wrote:

> > What I am proposing does not even involve a copy: when we want to take a
> > snapshot, we just have to force a sub-buffer switch on the ring buffer.
> > The "returns" happening at the beginning of the next (empty) sub-buffer
> > would clearly fail to discard records (expecting non-existing entry
> > records). We would then have to save a small record saying that a function
> > return occurred. The current stack frame at the end of the next sub-buffer
> > could be deduced from the complete collection of stack frame samples.
>
> And suppose the stack-trace was all of 16 entries (not uncommon for a kernel
> stack), then you waste a whole page for 128 bytes (assuming your sub-buffer
> is page sized). I'll take the memcopy, thank you.

To throw some hard numbers into the discussion, i found two random callgraph
perf.data's on my boxes (both created prior the start of this discussion) and
here is the distribution of their call-chain length:

aldebaran:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
2 4
21 6
23 8
13 9
20 10
29 11
21 12
25 13
54 14
112 15
72 16
77 17
35 18
38 19
48 20
29 21
10 22
97 23
3 24
1 25
2 26
2 28
2 29
1 30
2 31

So the peak/average here is around 15 entries.

The other one:

phoenix:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
1 2
70 3
222 4
112 5
116 6
329 7
241 8
163 9
203 10
287 11
159 12
4 13
6 14
22 15
2 16
11 17
5 18

Here the average is even lower - around 8 entries.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Dave Chinner <david(a)fromorbit.com> wrote:

> On Tue, Aug 03, 2010 at 11:56:11AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
> >
> > One thing that I think could easily make sense in a _lot_ of buffering
> > areas is the notion of a "continuation" buffer. We know we have cases
> > where we want to attach a lot of data to one particular event, but the
> > buffering itself is inevitably always going to have some limits on
> > atomicity etc. And quite often, the event that _generates_ the data is
> > not necessarily going to have all that data in one contiguous region,
> > and doing a scatter-gather memcpy to get it that way is not good
> > either.
> >
> > At the same time, I do _not_ believe that the kernel ring-buffer code
> > should handle pointers to sub-buffers etc, or worry about iovec-like
> > arrays of smaller ranges. So if _that_ is what you mean by "concept of
> > sub-buffers", then I agree with you.
> >
> > But what I do think might make a lot of sense is to allow buffer
> > fragments, and just teach user space to do de-fragmentation. Where it
> > would be important that the de-fragmentation really is all in user
> > space, and not really ever visible to the ring-buffer implementation
> > itself (and there would not, for example, be any guarantees that the
> > fragments would be contiguous - there could be other events in the
> > buffer in between fragments). Maybe we could even say that fragments
> > might be across different CPU ring-buffers, and user-space needs to
> > sort it out if it wants to (where "sort it out" literally would mean
> > having to sort and re-attach them in the right order, since there
> > wouldn't be any ordering between them).
> >
> > From a kernel perspective, the only thing you need for fragment
> > handling would be to have a buffer entry that just says "I'm fragment
> > number X of event ID Y". Nothing more. Everything else would be up to
> > the parser in user space to work out.
>
> Heh. For a moment there I thought you were describing the the way XFS writes
> transactions into it's log. Replace "CPU ring-buffers" with "in-core log
> buffers", "userspace parsing" with "log recovery" and "event ID" with
> "transaction ID", and the concept you describe is eerily similar. That
> includes the fact that transactions are not contiguous in the log, can
> interleave fragments between concurrent transaction commits and they can
> span multiple log buffers, too. It works pretty well for scaling concurrent
> writers....

That's certainly a good model when you have to stream into a
persistent-storage transaction log space with multiple writers.

The difference is that with instrumentation we are generally able to make
things per task or per cpu so there's no real multi-CPU 'concurrent writers'
concurrency.

You dont have that luxory/simplicity when logging to storage, of course!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
>
> The sub-buffer thing that both ftrace and lttng have is creating a large
> buffer from a lot of small buffers, I simply don't see the point of
> doing that. It adds complexity and limitations for very little gain.

The first major gain is the ability to implement flight recorder tracing
(overwrite mode), which Perf still lacks.

A second major gain: having these sub-buffers lets the trace analyzer seek in
the trace very efficiently by allowing it to perform a binary search for time to
find the appropriate sub-buffer. It becomes immensely useful with large traces.

The third major gain: for live streaming of traces, having sub-buffer lets you
"package" the event data you send over the network into sub-buffers. So the
trace analyzer, receiving this information live while the trace is being
recorded, can start using the information when the full sub-buffer is received.
It does not have to play games with the last event (or event header) perhaps
being incompletely sent, which imply that you absolutely _need_ to save the
event size along with each event header (you cannot simply let the analyzer
parse the event payload to determine the size). Here again, space wasted.
Furthermore, this deals with information loss: a trace is still readable even if
a sub-buffer must be discarded.

Making sure events don't cross sub-buffer boundaries simplify a lot of things,
starting with dealing with "overwritten" sub-buffers in flight recorder mode.
Trying to deal with a partially overwritten event is just insane.

>
> Their benefit is known synchronization points into the stream, you can
> parse each sub-buffer independently, but you can always break up a
> continuous stream into smaller parts or use a transport that includes
> index points or whatever.

I understand that you could perform amortized synchronization without
sub-buffers. I however don't see how flight recorder, efficient seek on multi-GB
traces (without reading the whole event stream), and live streaming can be
achieved.

> Their down side is that you can never have individual events larger than
> the sub-buffer,

True. But with configurable sub-buffer size (can be from 4kB to many MB), I
don't see the problem.

> you need to be aware of the sub-buffer when reserving
> space

Only the ring buffer needs to be aware of that. It returns an error if the event
is larger than the sub-buffer size.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz(a)infradead.org) wrote:
> > > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> > >
> > > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > > support for your use-case. It shares a lot of requirements with a standard ring
> > > > buffer:
> > > >
> > > > - Need to be lock-less
> > > > - Need to reserve space, write data in a buffer
> > > >
> > > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > > dynamically),
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > This reluctance against splitting a buffer into sub-buffers might contribute to
> > explain the poor performance experienced with the Perf ring buffer.
>
> That's just unsubstantiated FUD.

Extracted from:
http://lkml.org/lkml/2010/7/9/368

(executive summary)

* Throughput

* Flight recorder mode

Ring Buffer Library 83 ns/entry (512kB sub-buffers, no reader)
89 ns/entry (512kB sub-buffers: read 0.3M entries/s)


Ftrace Ring Buffer: 103 ns/entry (no reader)
187 ns/entry (read by event: read 0.4M entries/s)

Perf record (flight recorder mode unavailable)


* Discard mode

Ring Buffer Library: 96 ns/entry discarded
257 ns/entry written (read: 2.8M entries/s)

Perf Ring Buffer: 423 ns/entry written (read: 2.3M entries/s)
(Note that this number is based on the perf event approximation output (based on
a 24 bytes/entry estimation) rather than the benchmark module count due its
inaccuracy, which is caused by perf not letting the benchmark module know about
discarded events.)

It is really hard to get a clear picture of the data write overhead with perf,
because you _need_ to consume data. Making perf support flight recorder mode
would really help getting benchmarks that are easier to compare.

>
> > These
> > "sub-buffers" are really nothing new: these are called "periods" in the audio
> > world. They help lowering the ring buffer performance overhead because:
> >
> > 1) They allow writing into the ring buffer without SMP-safe synchronization
> > primitives and memory barriers for each record. Synchronization is only needed
> > across sub-buffer boundaries, which amortizes the cost over a large number of
> > events.
>
> The only SMP barrier we (should) have is when we update the user visible
> head pointer. The buffer code itself uses local{,64}_t for all other
> atomic ops.
>
> If you want to amortize that barrier, simply hold off the head update
> for a while, no need to introduce sub-buffers.

I understand your point about amortized synchronization. However I still don't
see how you can achieve flight recorder mode, efficient seek on multi-GB traces
without reading the whole event stream, and live streaming without sub-buffers
(and, ideally, without much headhaches involved). ;)

>
> > 2) They are much more splice (and, in general, page-exchange) friendly, because
> > records written after a synchronization point start at the beginning of a page.
> > This removes the need for extra copies.
>
> This just doesn't make any sense at all, I could splice full pages just
> fine, splice keeps page order so these synchronization points aren't
> critical in any way.

If you need to read non-filled pages, then you need to splice pages piece-wise.
This does not fit well with flight recorder tracing, for which the solution
Steven and I have found is to atomically exchange pages (for Ftrace) or
sub-buffers (for the generic ring buffer library) between the reader and writer.

>
> The only problem I have with splice atm is that we don't have a buffer
> interface without mmap() and we cannot splice pages out from under
> mmap() on all architectures in a sane manner.

The problem Perf has is probably more with flight recorder (overwrite) tracing
support than splice() per se, in this you are right.

>
> > So I have to ask: do you detest the sub-buffer concept only because you are tied
> > to the current Perf userspace ABI which cannot support this without an ABI
> > change ?
>
> No because I don't see the point.

OK, good to know you are open to ABI changes if I present convincing arguments.

>
> > I'm trying to help out here, but it does not make the task easy if we have both
> > hands tied in our back because we have to keep backward ABI compatibility for a
> > tool (perf) forever, even considering its sources are shipped with the kernel.
>
> Dude, its a published user<->kernel ABI, also you're not saying why you
> would want to break it. In your other email you allude to things like
> flight recorder mode, that could be done with the current set-up, no
> need to break the ABI at all. All you need to do is track the tail
> pointer and publish it.

How do you plan to read the data concurrently with the writer overwriting the
data while you are reading it without corruption ?

>
> > Nope. I'm thinking that we can use a buffer just to save the stack as we call
> > functions and return, e.g.
>
> We don't have a callback on function entry, and I'm not going to use
> mcount for that, that's simply insane.

OK, now I get a clearer picture of what Frederic is trying to do.

>
> > call X -> reserve space to save "X" and arguments.
> > call Y -> same for Y.
> > call Z -> same for Z.
> > return -> discard event for Z.
> > return -> discard event for Y.
> >
> > if we grab the buffer content at that point, then we have X and its arguments,
> > which is the function currently executed. That would require the ability to
> > uncommit and unreserve an event, which is not a problem as long as we have not
> > committed a full sub-buffer.
>
> Again, I'm not really seeing the point of using sub-buffers at all.

This part of the email is unrelated to sub-buffers.

>
> Also, what happens when we write an event after Y? Then the discard must
> fail or turn Y into a NOP, leaving a hole in the buffer.

Given that this buffer is simply used to dump the stack unwind result then I
think my scenario above was simply mislead.

>
> > I thought that this buffer was chasing the function entry/exits rather than
> > doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> > about his use-case ?
>
> No, its a pure stack unwind from NMI context. When we get an event (PMI,
> tracepoint, whatever) we write out event, if the consumer asked for a
> stacktrace with each event, we unwind the stack for him.

So why the copy ? Frederic seems to put the stack unwind in a special temporary
buffer. Why is it not saved directly into the trace buffers ?

> > > Additionally, if you have multiple consumers you can simply copy the
> > > stacktrace again, avoiding the whole pointer chase exercise. While you
> > > could conceivably copy from one ringbuffer into another that will result
> > > in very nasty serialization issues.
> >
> > Assuming Frederic is saving information to this stack-like ring buffer at each
> > function entry and discarding at each function return, then we don't have the
> > pointer chase.
> >
> > What I am proposing does not even involve a copy: when we want to take a
> > snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> > "returns" happening at the beginning of the next (empty) sub-buffer would
> > clearly fail to discard records (expecting non-existing entry records). We would
> > then have to save a small record saying that a function return occurred. The
> > current stack frame at the end of the next sub-buffer could be deduced from the
> > complete collection of stack frame samples.
>
> And suppose the stack-trace was all of 16 entries (not uncommon for a
> kernel stack), then you waste a whole page for 128 bytes (assuming your
> sub-buffer is page sized). I'll take the memcopy, thank you.

Well, now that I understand what you are trying to achieve, I retract my
proposal of using a stack-like ring buffer for this. I think that the stack dump
should simply be saved directly to the ring buffer, without copy. The
dump_stack() functions might have to be extended so they don't just save text
dumbly, but can also be used to save events into the trace in binary format,
perhaps with the continuation cookie Linus was proposing.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:

> How do you plan to read the data concurrently with the writer overwriting the
> data while you are reading it without corruption ?

I don't consider reading while writing (in overwrite mode) a valid case.

If you want to use overwrite, stop the writer before reading it.

> I think that the stack dump
> should simply be saved directly to the ring buffer, without copy. The
> dump_stack() functions might have to be extended so they don't just save text
> dumbly, but can also be used to save events into the trace in binary format,
> perhaps with the continuation cookie Linus was proposing.

Because I don't want to support truncating reservations (because that
leads to large nops for nested events) and when the event needs to go to
multiple buffers you can re-use the stack-dump without having to do the
unwind again.

The problem with the continuation thing Linus suggested is that it would
bloat the output 3 fold. A stack entry is a single u64. If you want to
wrap that in a continuation event you need: a header (u64), a cookie
(u64) and the entry (u64).

Continuation events might make heaps of sense for larger data pieces,
but I don't see them being practical for such small pieces.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/