From: Peter Zijlstra on
On Fri, 2010-08-06 at 12:11 +0200, Peter Zijlstra wrote:
> > You need to read the whole trace to find these cookies (even if it is just once
> > at the beginning if you create an index).

Even if you want to index all sync points you can quickly skip through
the file using the sync-distance, after which you'll have, on average,
only 1/2 avg-event-size to read before you find your next sync point.

So suppose you have a 1M sync-distance, and an effective average event
size of 128 bytes, then for a 4G file, you can find all sync points by
only reading ~262144 bytes (not counting for the fact that the pagecache
will bring in full pages, which would result in something like 16M to be
read in total or somesuch -- which, again assumes read-ahead isn't going
to play tricks on you).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> > Peter Zijlstra wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > >
> > >> How do you plan to read the data concurrently with the writer overwriting the
> > >> data while you are reading it without corruption ?
> > >
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > >
> > > If you want to use overwrite, stop the writer before reading it.
> >
> > For example, would you like to read system audit log always after
> > stop the audit?
> >
> > NO, that's a most important requirement for tracers, especially for
> > system admins (they're the most important users of Linux) to check
> > the system health and catch system troubles.
> >
> > For performance measurement and checking hotspot, one-shot tracing
> > is enough. But it's just for developers. But for the real world
> > computing, Linux is just an OS, users want to run their system,
> > middleware and applications, without troubles. But when they hit
> > a trouble, they wanna shoot it ASAP.
> > The flight recorder mode is mainly for those users.
>
> You cannot over-write and consistently read the buffer, that's plain
> impossible.

If you think it is impossible, then you should really go have a look at the
generic ring buffer library, at LTTng and at Ftrace. It looks like we're all
doing the "impossible".

> With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.

We don't care about taking the next adjascent sub-buffer. We care about always
grabbing the oldest sub-buffer that has been written up to the currentmost
one.

>
> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.

So you need to allocate many trace buffers to accomplish the same and an extra
layer on top that does this buffer exchange, I don't call that "simple". Note
that only two trace buffers might not be enough if you have repeated failures in
a short time window; the consumer might take some time to extract all these.

Compared to that, the sub-buffer scheme only needs a single buffer with 2 (or
more) sub-buffers, plus an extra sub-buffer owned by the reader that we exchange
with the sub-buffer we want to grab for reading. The reader always grabs the
sub-buffer with the oldest data into it. The number of sub-buffers used is the
limit on the number of snapshots that can be taken in a relatively short time
window (the time it takes to the reader to consume the data).

>
> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

Within a sub-buffer, events are adjascent, and between sub-buffers, events are
guaranteed to be in order (oldest to newest event). It is only in the case where
buffers are relatively small compared to the data throughput that the writer can
overwrite information that would have been useful for a snapshot (e.g.
overwriting relatively recent information while the reader reads the oldest
sub-buffer), but in that case users simply have to tune they buffer size
appropriately to match the trace data throughput.

>
> So no, I will _not_ support reading an over-write buffer while there is
> an active reader.

(I guess you mean active writer)

Here you argue that you don't need to support this feature at the ring buffer
level because you can have a group of ring buffers that does it instead.
How is your multiple-buffer scheme any simpler than sub-buffers ? Either you
have to allocate many of them up front, or, if you want to do it on-demand, you
have to perform memory allocation in NMI context. I don't see any of these two
solutions as particularly appealing.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Thu, 2010-08-05 at 21:49 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz(a)infradead.org) wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > >
> > > > How do you plan to read the data concurrently with the writer overwriting the
> > > > data while you are reading it without corruption ?
> > >
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > >
> > > If you want to use overwrite, stop the writer before reading it.
> >
> > How inconvenient. It happens that the relatively large group of users I am
> > working for do care for this use-case. They cannot afford to stop tracing as
> > soon as they hit "one bug". This "bug" could be a simple odd scenario that they
> > want to snapshot, but in all cases they want tracing to continue.
>
> Snapshot is fine, just swivel the whole buffer.

There is a very important trade-off between the amount of information that can
be kept around in memory to take as snapshot and the amount of system memory
reserved for buffers. The sub-buffer scheme is pretty good at that: the whole
memory reserved (except the extra reader-owned sub-buffer) is available to save
the flight recorder trace.

With the multiple-buffer scheme you propose, only one of the buffers can be used
to save data. This is very limiting, especially for embedded systems in telecom
switches that does not have that much memory: all the memory reserved for the
buffer that is currently inactive is simply wasted. It does not even allow the
user to gather a longer snapshot.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Fri, 2010-08-06 at 12:11 +0200, Peter Zijlstra wrote:
> > > You need to read the whole trace to find these cookies (even if it is just once
> > > at the beginning if you create an index).
>
> Even if you want to index all sync points you can quickly skip through
> the file using the sync-distance, after which you'll have, on average,
> only 1/2 avg-event-size to read before you find your next sync point.
>
> So suppose you have a 1M sync-distance, and an effective average event
> size of 128 bytes, then for a 4G file, you can find all sync points by
> only reading ~262144 bytes (not counting for the fact that the pagecache
> will bring in full pages, which would result in something like 16M to be
> read in total or somesuch -- which, again assumes read-ahead isn't going
> to play tricks on you).

How do you distinguish between sync events and random payload data ?

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Thu, 2010-08-05 at 21:42 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz(a)infradead.org) wrote:
> > > On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
[...]
> > > > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > > > the trace very efficiently by allowing it to perform a binary search for time to
> > > > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> > >
> > > You can add sync events with a specific magic cookie in. Once you find
> > > the cookie you can sync and start reading it reliably
> >
> > You need to read the whole trace to find these cookies (even if it is just once
> > at the beginning if you create an index).
>
> Depends on what you want to do, you can start reading at any point in
> the stream and be guaranteed to find a sync point within sync-distance
> +max-event-size.

At _any_ point in the stream ?

So if I take, let's say, a few kB of Perf ring buffer data and I choose to
encode than into an event into another buffer (e.g. we're tracing part of the
network traffic). Then we end up in a situation where the event payload will
contain your "so special" sync point data. Basically, you have no guarantee that
you won't mix up standard event data and your synchronization event headers.

Your sync point solution just kills all encapsulation good practices in one go.

> > My experience with users have shown me
> > that the delay between stopping trace gathering having the data shown to the
> > user is very important, because this is repeatedly done while debugging a
> > problem, and this is time the user is sitting in front of his screen, waiting.
>
> Yeah, because after having had to wait for 36h for the problem to
> trigger that extra minute really kills.
>
> All I can say is that in my experience brain throughput is the limiting
> factor in debugging. Not some ability to draw fancy pictures.

Here I have to bring up the fact that Linux kernel developers are not the only
tracer users.

Traces of multi-GB can be generated easily within a few seconds/minutes on many
workloads, so we're not talking of many-hours-traces here. But if we need to
read the whole trace before it can be shown, we're adding a significant delay
before the trace can be accessed.

In my experience, both brain and data gathering throughputs are limiting factors
to debugging. Drawing fancy pictures merely helps speeding up the brain process
in some cases.


>
> > > -- the advantage
> > > is that sync events are very easy to have as an option and don't
> > > complicate the reserve path.
> >
> > Perf, on its reserve/commit fast paths:
> >
> > perf_output_begin: 543 bytes
> > (perf_output_get_handle is inlined)
> >
> > perf_output_put_handle: 201 bytes
> > perf_output_end: 77 bytes
> > calls perf_output_put_handle
> >
> > Total for perf: 821 bytes
> >
> > Generic Ring Buffer Library reserve/commit fast paths:
> >
> > Reserve: 511 bytes
> > Commit: 266 bytes
> > Total for Generic Ring Buffer: 777 bytes
> >
> > So the generic ring buffer is not only faster and supports sub-buffers (along
> > with all the nice features this brings); its reserve and commit hot paths
> > fit in less instructions: it is *less* complicated than Perf's.
>
> All I can say is that less code doesn't equal less complex (nor faster
> per-se).

Less code = less instruction cache overhead. I've also shown that the LTTng code
is at least twice faster. In terms of complexity, it is not much more complex; I
also took the extra care of doing the formal proofs to make sure the
corner-cases were dealt with, which I don't reckon neither Steven nor yourself
have done.

> Nor have I spend all my time on writing the ring-buffer,
> there's more interesting things to do.

I must admit that I probably spent much more time working on the ring buffer
than you did. It looks like one's interest can only focus on so many areas at
once. So if you are not that interested in ring buffers, can you at least stop
opposing to people who care deeply ?

If we agree that we don't care about the same use-cases, there might be room for
many ring buffers in the kernel. It's just a shame that we have to multiply
amount of code-review. We have to note that this goes against Linus' request for
a shared and common ring buffer used by all tracers.


> And the last time I ran perf on perf, the buffer wasn't the thing that
> was taking most time.

Very interesting. I know the trace clock performance are terrible too. But let's
keep that for another discussion please.

>
> And unlike what you claim below, it most certainly can deal with events
> larger than a single page.

What I said below was: perf cannot write events larger than its buffer size. So
it already has to take that "test" branch for maximum event size. I said nothing
about page size in this context.

>
> > > If you worry about the cost of parsing the events, you can amortize that
> > > by things like keeping the offset of the first event in every page in
> > > the pageframe, or the offset of the next sync event or whatever scheme
> > > you want.
> >
> > Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
> > won't be exported to user-space, so how is the parser supposed to see this
> > information exactly to help it speeding up parsing ?
>
> Its about the kernel parsing the buffer to push the tail ahead of the
> reserve window, so that you have a reliable point to start reading the
> trace from -- or didn't you actually get the intent of that patch?

I got the intent of the patch, I just somehow missed that this paragraph was
applying to the patch specifically.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/