x86_64 page fault NMI-safe [Kernel]

Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics

From: Mathieu Desnoyers on 3 Aug 2010 17:00

* Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
> On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> <mathieu.desnoyers(a)efficios.com> wrote:
> >
> > The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
> > details to the userspace perf ABI, and there is now strong unwillingness from
> > Perf developers to break this ABI.
>
> The thing is - I think my outlined buffer fragmentation model would
> work fine with the perf ABI too. Exactly because there is no deep
> structure, just the same "stream of small events" both from a kernel
> and a user model standpoint. Sure, the stream would now contain a new
> event type, but that's trivial. It would still be _entirely_
> reasonable to have the actual data in the exact same ring buffer,
> including the whole mmap'ed area.

Yes, indeed. Your scheme (using a "cookie" to identify multiple related events,
each of them being the "continuation" of the previous event with the same
cookie) would work on top of basically all ring buffers implementations. We
already use something similar to follow socket buffers and block device buffers
across the kernel in LTTng.

>
> Of course, when user space actually parses it, user space would have
> to eventually defragment the event by allocating a new area and
> copying the fragments together in the right order, but that's pretty
> trivial to do. It certainly doesn't affect the current mmap'ed
> interface in the least.
>
> Now, whether the perf people feel they want that kind of
> functionality, I don't know. It's possible that they simply do not
> want to handle events that are complex enough that they would have
> arbitrary size.

I agree. Although I think the scheme you propose can sit on top of the ring
buffer and does not necessarily need to be at the bottom layer. The sub-buffer
disagreement Peter and I have is related to the ring buffer core.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mathieu Desnoyers on 3 Aug 2010 17:20

* Ingo Molnar (mingo(a)elte.hu) wrote:
>
> * Ingo Molnar <mingo(a)elte.hu> wrote:
>
> >
> > * Linus Torvalds <torvalds(a)linux-foundation.org> wrote:
> >
> > > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> > > <mathieu.desnoyers(a)efficios.com> wrote:
> > > >
> > > > The real issue here, IMHO, is that Perf has tied gory ring buffer
> > > > implementation details to the userspace perf ABI, and there is now strong
> > > > unwillingness from Perf developers to break this ABI.
> >
> > (Wrong.)

I am glad to hear this. So should I understand that if we show that the current
perf ABI imposes significant design constraints and results in poor performance
and inability to support flight recorder mode (which is needed to unify the ring
buffers), we can deprecate the ABI ?

[...]

> > We may want to add things like a NOP event to pad out the end of page

Or simply write the page (or sub-buffer) size information in a page (or
sub-buffer) header. The gain here is that by doing so we don't have to reserve
an event ID for the NOP event, which adds one extra ID reserved in _each_ event
header. You might be tempted to say "oh, it's just a single value, who cares ?",
but with the amount of data we're moving, being able to represent the event
header on a very small amount of bits really makes a difference. Bloat creeps in
one single bit at a time until we start not caring about adding whole integers,
and when we're there the game was over long ago: performance suffer deeply.

The huge size of the perf event headers is another factor that might explain its
poor performance by the way.

[...]

> [ The control structure of the mmap area is there for performance/wakeup
> optimizations

I am doubtful about an "optimization" that affects what should be a slow path:
user-space wakeup for delivering a multiple events at once. Have you checked if
this leads to actual noticeable performance increase at all ?

> (and to allow the kernel to lose information on producer
> overload, while still giving user-space an idea that we lost data and how
> much)

This can be performed with a standard system call rather than playing games
with a shared pages into which both the kernel and user-space write. The
advantage is that by letting user-space calling the kernel (rather than just
writing "I'm done" in that page by updating the consumer value), we can let the
kernel perform tasks that might enable us to implement flight recorder mode all
within the same ring buffer implementation.

> - it does not affect semantics and does not limit us. ]

Well, so far, the main limitation I can see is that it does not allow us to do
flight recorder tracing (a.k.a. overwrite mode).

>
> So there's no design limitation - Peter simply prefers one possible solution
> over another and outlined his reasons - we should hash that out based on the
> technical arguments.

Another argument I've seen from Peter is that he prefers the perf
kernel-userspace interaction to happen through this shared page to diminish the
number of traced events generated by perf activity. But I find this argument
unconvincing, because it really only applies to system call tracing: the rest of
tracing will be affected by the perf user-space process activity. So we might as
well just bite the bullet and accept that the trace is "polluted" by user-space
perf events. It _is_ using up CPU time anyway, so I think it's actually _better_
to know about it, rather than to try to hide the tracer activity. If one really
wants to filter out the tracer activity, it can be done at post-processing
without problem. But at least the information is there.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 4 Aug 2010 02:30

On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?

The sub-buffer thing that both ftrace and lttng have is creating a large
buffer from a lot of small buffers, I simply don't see the point of
doing that. It adds complexity and limitations for very little gain.

Their benefit is known synchronization points into the stream, you can
parse each sub-buffer independently, but you can always break up a
continuous stream into smaller parts or use a transport that includes
index points or whatever.

Their down side is that you can never have individual events larger than
the sub-buffer, you need to be aware of the sub-buffer when reserving
space etc..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 4 Aug 2010 02:50

On Tue, Aug 03, 2010 at 11:56:11AM -0700, Linus Torvalds wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?
>
> One thing that I think could easily make sense in a _lot_ of buffering
> areas is the notion of a "continuation" buffer. We know we have cases
> where we want to attach a lot of data to one particular event, but the
> buffering itself is inevitably always going to have some limits on
> atomicity etc. And quite often, the event that _generates_ the data is
> not necessarily going to have all that data in one contiguous region,
> and doing a scatter-gather memcpy to get it that way is not good
> either.
>
> At the same time, I do _not_ believe that the kernel ring-buffer code
> should handle pointers to sub-buffers etc, or worry about iovec-like
> arrays of smaller ranges. So if _that_ is what you mean by "concept of
> sub-buffers", then I agree with you.
>
> But what I do think might make a lot of sense is to allow buffer
> fragments, and just teach user space to do de-fragmentation. Where it
> would be important that the de-fragmentation really is all in user
> space, and not really ever visible to the ring-buffer implementation
> itself (and there would not, for example, be any guarantees that the
> fragments would be contiguous - there could be other events in the
> buffer in between fragments). Maybe we could even say that fragments
> might be across different CPU ring-buffers, and user-space needs to
> sort it out if it wants to (where "sort it out" literally would mean
> having to sort and re-attach them in the right order, since there
> wouldn't be any ordering between them).
>
> From a kernel perspective, the only thing you need for fragment
> handling would be to have a buffer entry that just says "I'm fragment
> number X of event ID Y". Nothing more. Everything else would be up to
> the parser in user space to work out.

Heh. For a moment there I thought you were describing the the way
XFS writes transactions into it's log. Replace "CPU ring-buffers"
with "in-core log buffers", "userspace parsing" with "log recovery"
and "event ID" with "transaction ID", and the concept you describe
is eerily similar. That includes the fact that transactions are not
contiguous in the log, can interleave fragments between concurrent
transaction commits and they can span multiple log buffers, too. It
works pretty well for scaling concurrent writers....

I'll get back in my box now ;)

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 4 Aug 2010 02:50

On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz(a)infradead.org) wrote:
> > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> >
> > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > support for your use-case. It shares a lot of requirements with a standard ring
> > > buffer:
> > >
> > > - Need to be lock-less
> > > - Need to reserve space, write data in a buffer
> > >
> > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > dynamically),
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> This reluctance against splitting a buffer into sub-buffers might contribute to
> explain the poor performance experienced with the Perf ring buffer.

That's just unsubstantiated FUD.

> These
> "sub-buffers" are really nothing new: these are called "periods" in the audio
> world. They help lowering the ring buffer performance overhead because:
>
> 1) They allow writing into the ring buffer without SMP-safe synchronization
> primitives and memory barriers for each record. Synchronization is only needed
> across sub-buffer boundaries, which amortizes the cost over a large number of
> events.

The only SMP barrier we (should) have is when we update the user visible
head pointer. The buffer code itself uses local{,64}_t for all other
atomic ops.

If you want to amortize that barrier, simply hold off the head update
for a while, no need to introduce sub-buffers.

> 2) They are much more splice (and, in general, page-exchange) friendly, because
> records written after a synchronization point start at the beginning of a page.
> This removes the need for extra copies.

This just doesn't make any sense at all, I could splice full pages just
fine, splice keeps page order so these synchronization points aren't
critical in any way.

The only problem I have with splice atm is that we don't have a buffer
interface without mmap() and we cannot splice pages out from under
mmap() on all architectures in a sane manner.

> So I have to ask: do you detest the sub-buffer concept only because you are tied
> to the current Perf userspace ABI which cannot support this without an ABI
> change ?

No because I don't see the point.

> I'm trying to help out here, but it does not make the task easy if we have both
> hands tied in our back because we have to keep backward ABI compatibility for a
> tool (perf) forever, even considering its sources are shipped with the kernel.

Dude, its a published user<->kernel ABI, also you're not saying why you
would want to break it. In your other email you allude to things like
flight recorder mode, that could be done with the current set-up, no
need to break the ABI at all. All you need to do is track the tail
pointer and publish it.

> Nope. I'm thinking that we can use a buffer just to save the stack as we call
> functions and return, e.g.

We don't have a callback on function entry, and I'm not going to use
mcount for that, that's simply insane.

> call X -> reserve space to save "X" and arguments.
> call Y -> same for Y.
> call Z -> same for Z.
> return -> discard event for Z.
> return -> discard event for Y.
>
> if we grab the buffer content at that point, then we have X and its arguments,
> which is the function currently executed. That would require the ability to
> uncommit and unreserve an event, which is not a problem as long as we have not
> committed a full sub-buffer.

Again, I'm not really seeing the point of using sub-buffers at all.

Also, what happens when we write an event after Y? Then the discard must
fail or turn Y into a NOP, leaving a hole in the buffer.

> I thought that this buffer was chasing the function entry/exits rather than
> doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> about his use-case ?

No, its a pure stack unwind from NMI context. When we get an event (PMI,
tracepoint, whatever) we write out event, if the consumer asked for a
stacktrace with each event, we unwind the stack for him.

> > Additionally, if you have multiple consumers you can simply copy the
> > stacktrace again, avoiding the whole pointer chase exercise. While you
> > could conceivably copy from one ringbuffer into another that will result
> > in very nasty serialization issues.
>
> Assuming Frederic is saving information to this stack-like ring buffer at each
> function entry and discarding at each function return, then we don't have the
> pointer chase.
>
> What I am proposing does not even involve a copy: when we want to take a
> snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> "returns" happening at the beginning of the next (empty) sub-buffer would
> clearly fail to discard records (expecting non-existing entry records). We would
> then have to save a small record saying that a function return occurred. The
> current stack frame at the end of the next sub-buffer could be deduced from the
> complete collection of stack frame samples.

And suppose the stack-trace was all of 16 entries (not uncommon for a
kernel stack), then you waste a whole page for 128 bytes (assuming your
sub-buffer is page sized). I'll take the memcopy, thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics