From: Linus Torvalds on
On Sun, Jul 18, 2010 at 11:04 AM, Avi Kivity <avi(a)redhat.com> wrote:
>
> Too simple - an MCE will switch to its own stack, failing the test. �Now
> that we have correctable MCEs, that's not a good idea.

Ahh, true. And I think we do DEBUG traps with IST too.

So we do need the explicit flag over the region. Too bad. I was hoping
to handle the nested case without having to set up the percpu segment
(that whole conditional swapgs thing, which is extra painful in NMI).

And at that point, if you require the separate flag anyway, the %rsp
range test is equivalent to the %rip range test.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Steven Rostedt on
On Sun, 2010-07-18 at 11:17 -0700, Linus Torvalds wrote:

> Oh, and the non-nested case obviously needs all the regular "make the
> kernel state look right" code. Like the swapgs stuff etc if required.
> My example code was really meant to just document the nesting
> handling, not the existing stuff we already need to do with
> save_paranoid etc.
>
> And I really think it should work, but I'd again like to stress that
> it's just a RFD code sequence with no testing what-so-ever etc.
>

Are you sure you don't want to use Mathieu's 2/2 patch? We are fixing
the x86 problem that iret re-enables NMIs, and you don't want to touch
anything else but the NMI code. But it may be saner to just fix the
places that call iret. We can perhaps encapsulate those into a single
macro that we can get right and will be correct everywhere it is used.

The ugliest part of Mathieu's code is dealing with paravirt, but
paravirt is ugly to begin with.

Doing this prevents nested NMIs as well as all the unknowns that will
come with dealing with nested NMIs. Where as, handling all iret's should
be straight forward, although a bit more intrusive than what we would
like.

Just saying,

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 07/18/2010 09:22 PM, Linus Torvalds wrote:
> On Sun, Jul 18, 2010 at 11:04 AM, Avi Kivity<avi(a)redhat.com> wrote:
>
>> Too simple - an MCE will switch to its own stack, failing the test. Now
>> that we have correctable MCEs, that's not a good idea.
>>
> Ahh, true. And I think we do DEBUG traps with IST too.
>
> So we do need the explicit flag over the region. Too bad. I was hoping
> to handle the nested case without having to set up the percpu segment
> (that whole conditional swapgs thing, which is extra painful in NMI).
>

Well, we have to do that anyway for the non-nested case. So we just do
it before checking whether we're nested or not, and undo it on the popf;
retf path.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:

> I was more thinking along the lines of making sure a ring buffer has the proper
> support for your use-case. It shares a lot of requirements with a standard ring
> buffer:
>
> - Need to be lock-less
> - Need to reserve space, write data in a buffer
>
> By configuring a ring buffer with 4k sub-buffer size (that's configurable
> dynamically),

FWIW I really utterly detest the whole concept of sub-buffers.

> all we need to add is the ability to squash a previously saved
> record from the buffer. I am confident we can provide a clean API for this that
> would allow discard of previously committed entry as long as we are still on the
> same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
> fine.

squash? truncate you mean? So we can allocate/reserve the largest
possible event size and write the actual event and then truncate to the
actually used size?

I really dislike how that will end up with huge holes in the buffer when
you get nested events.

Also, I think you're forgetting that doing the stack unwind is a very
costly pointer chase, adding a simple linear copy really doesn't seem
like a problem.

Additionally, if you have multiple consumers you can simply copy the
stacktrace again, avoiding the whole pointer chase exercise. While you
could conceivably copy from one ringbuffer into another that will result
in very nasty serialization issues.

> You could have one 4k ring buffer per cpu per execution context.

Why?

> I wonder if
> each Linux architecture have support for separated thread vs softirtq vs irq vs
> nmi stacks ?

Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ
inside task context in general (dismissing the nested IRQ mess). You
don't need to have a separate stack for each context in order to nest
them.

> Even then, given you have only one stack for all shared irqs, you
> need something that is concurrency-aware at the ring buffer level.

I'm failing to see you point.

> These small stack-like ring buffers could be used to save your temporary stack
> copy. When you really need to save it to a larger ring buffer along with a
> trace, then you just take a snapshot of the stack ring buffers.

OK, why? Your proposal includes the exact same extra copy but introduces
a ton of extra code to effect the same, not a win.

> So you get to use one single ring buffer synchronization and memory allocation
> mechanism, that everyone has reviewed. The advantage is that we would not be
> having this nmi race discussion in the first place: the generic ring buffer uses
> "get page" directly rather than relying on vmalloc, because these bugs have
> already been identified and dealt with years ago.

That's like saying don't use percpu_alloc() but open-code the thing
using kmalloc()/get_pages().. I really don't see any merit in that.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
>
> > I was more thinking along the lines of making sure a ring buffer has the proper
> > support for your use-case. It shares a lot of requirements with a standard ring
> > buffer:
> >
> > - Need to be lock-less
> > - Need to reserve space, write data in a buffer
> >
> > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > dynamically),
>
> FWIW I really utterly detest the whole concept of sub-buffers.

This reluctance against splitting a buffer into sub-buffers might contribute to
explain the poor performance experienced with the Perf ring buffer. These
"sub-buffers" are really nothing new: these are called "periods" in the audio
world. They help lowering the ring buffer performance overhead because:

1) They allow writing into the ring buffer without SMP-safe synchronization
primitives and memory barriers for each record. Synchronization is only needed
across sub-buffer boundaries, which amortizes the cost over a large number of
events.

2) They are much more splice (and, in general, page-exchange) friendly, because
records written after a synchronization point start at the beginning of a page.
This removes the need for extra copies.

So I have to ask: do you detest the sub-buffer concept only because you are tied
to the current Perf userspace ABI which cannot support this without an ABI
change ?

I'm trying to help out here, but it does not make the task easy if we have both
hands tied in our back because we have to keep backward ABI compatibility for a
tool (perf) forever, even considering its sources are shipped with the kernel.

>
> > all we need to add is the ability to squash a previously saved
> > record from the buffer. I am confident we can provide a clean API for this that
> > would allow discard of previously committed entry as long as we are still on the
> > same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
> > fine.
>
> squash? truncate you mean? So we can allocate/reserve the largest
> possible event size and write the actual event and then truncate to the
> actually used size?

Nope. I'm thinking that we can use a buffer just to save the stack as we call
functions and return, e.g.

call X -> reserve space to save "X" and arguments.
call Y -> same for Y.
call Z -> same for Z.
return -> discard event for Z.
return -> discard event for Y.

if we grab the buffer content at that point, then we have X and its arguments,
which is the function currently executed. That would require the ability to
uncommit and unreserve an event, which is not a problem as long as we have not
committed a full sub-buffer.

>
> I really dislike how that will end up with huge holes in the buffer when
> you get nested events.

This buffer only works like a stack. I don't think your comment apply.

>
> Also, I think you're forgetting that doing the stack unwind is a very
> costly pointer chase, adding a simple linear copy really doesn't seem
> like a problem.

I thought that this buffer was chasing the function entry/exits rather than
doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
about his use-case ?

>
> Additionally, if you have multiple consumers you can simply copy the
> stacktrace again, avoiding the whole pointer chase exercise. While you
> could conceivably copy from one ringbuffer into another that will result
> in very nasty serialization issues.

Assuming Frederic is saving information to this stack-like ring buffer at each
function entry and discarding at each function return, then we don't have the
pointer chase.

What I am proposing does not even involve a copy: when we want to take a
snapshot, we just have to force a sub-buffer switch on the ring buffer. The
"returns" happening at the beginning of the next (empty) sub-buffer would
clearly fail to discard records (expecting non-existing entry records). We would
then have to save a small record saying that a function return occurred. The
current stack frame at the end of the next sub-buffer could be deduced from the
complete collection of stack frame samples.

>
> > You could have one 4k ring buffer per cpu per execution context.
>
> Why?

This seems to fit what Frederic described he needed: he uses one separate buffer
per cpu per execution context at the moment. But we could arguably save
all this stack-shaped information in per-cpu buffers.

>
> > I wonder if
> > each Linux architecture have support for separated thread vs softirtq vs irq vs
> > nmi stacks ?
>
> Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ
> inside task context in general (dismissing the nested IRQ mess). You
> don't need to have a separate stack for each context in order to nest
> them.

I was asking this because Frederic seems to rely on having separate buffers per
cpu and per execution context to deal with concurrency (so not expecting
concurrency from interrupts or NMIs when writing into the softirq per-cpu stack
buffer).

>
> > Even then, given you have only one stack for all shared irqs, you
> > need something that is concurrency-aware at the ring buffer level.
>
> I'm failing to see you point.

My point is that we might need to expect concurrency from local execution
contexts (e.g. interrupts nested over other interrupt handlers) in the design of
this stack-like ring buffer. I'm not sure Frederic's approach of using one
buffer per execution context per cpu makes sense for all cases. The memory vs
context isolation trade-off seems rather odd if we have to create e.g. one
buffer per IRQ number.

>
> > These small stack-like ring buffers could be used to save your temporary stack
> > copy. When you really need to save it to a larger ring buffer along with a
> > trace, then you just take a snapshot of the stack ring buffers.
>
> OK, why? Your proposal includes the exact same extra copy but introduces
> a ton of extra code to effect the same, not a win.

Please refer to the "no extra copy" solution I explain in the reply here (see
above). I did not want to go into too much details regarding performance
optimization in the initial mail to Frederic, as these things can be done
incrementally. But given that you insist... :)

>
> > So you get to use one single ring buffer synchronization and memory allocation
> > mechanism, that everyone has reviewed. The advantage is that we would not be
> > having this nmi race discussion in the first place: the generic ring buffer uses
> > "get page" directly rather than relying on vmalloc, because these bugs have
> > already been identified and dealt with years ago.
>
> That's like saying don't use percpu_alloc() but open-code the thing
> using kmalloc()/get_pages().. I really don't see any merit in that.

I'm not saying "open-code this". I'm saying "use a specialized library that does
this get_pages() allocation and execution context synchronization for you", so
we stop the code duplication madness.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/