x86_64 page fault NMI-safe [Kernel]

Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics

From: Masami Hiramatsu on 7 Aug 2010 06:00

Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
>> Peter Zijlstra wrote:
>>> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>>>
>>>> How do you plan to read the data concurrently with the writer overwriting the
>>>> data while you are reading it without corruption ?
>>> I don't consider reading while writing (in overwrite mode) a valid case.
>>>
>>> If you want to use overwrite, stop the writer before reading it.
>> For example, would you like to read system audit log always after
>> stop the audit?
>>
>> NO, that's a most important requirement for tracers, especially for
>> system admins (they're the most important users of Linux) to check
>> the system health and catch system troubles.
>>
>> For performance measurement and checking hotspot, one-shot tracing
>> is enough. But it's just for developers. But for the real world
>> computing, Linux is just an OS, users want to run their system,
>> middleware and applications, without troubles. But when they hit
>> a trouble, they wanna shoot it ASAP.
>> The flight recorder mode is mainly for those users.
>
> You cannot over-write and consistently read the buffer, that's plain
> impossible. With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.

Right, we cannot ensure that. In over-written mode, reader could lose
some data, because of overwriting by writers. (or writer may fails
to write new data on buffer in non-overwritten mode)
However, I think that doesn't mean this mode is completely useless.
If we can know when(where) the data was lost, the rest of data
is enough useful in some cases.

> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.

Hmm, would that consume much more memory compared with sub-buffer
ring buffer if we have spare buffers?
Or, if allocating it after reader opens buffer, will it also slow
down the reader process?

> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

I think there is a trade-off of perfect snapshot and consuming memory,
and it depends on use-case in many cases.

>
> So no, I will _not_ support reading an over-write buffer while there is
> an active reader.
>

I hope you to reconsider how over-write buffer is useful even if
it is far from perfect.

Thank you,

--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt(a)hitachi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 9 Aug 2010 13:00

On Fri, Aug 06, 2010 at 11:50:40AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> > Peter Zijlstra wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > >
> > >> How do you plan to read the data concurrently with the writer overwriting the
> > >> data while you are reading it without corruption ?
> > >
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > >
> > > If you want to use overwrite, stop the writer before reading it.
> >
> > For example, would you like to read system audit log always after
> > stop the audit?
> >
> > NO, that's a most important requirement for tracers, especially for
> > system admins (they're the most important users of Linux) to check
> > the system health and catch system troubles.
> >
> > For performance measurement and checking hotspot, one-shot tracing
> > is enough. But it's just for developers. But for the real world
> > computing, Linux is just an OS, users want to run their system,
> > middleware and applications, without troubles. But when they hit
> > a trouble, they wanna shoot it ASAP.
> > The flight recorder mode is mainly for those users.
>
> You cannot over-write and consistently read the buffer, that's plain
> impossible. With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.
>
> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.
>
> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

It all depends on the frequency on your events and on the amount of memory
used for the buffer.

If you are tracing syscalls in a semi-idle box with a ring buffer of 500 MB
per cpu, you really don't care about the writer catching up the reader: it
will simply not happen.

OTOH if you are tracing function graphs, no buffer size will ever be enough:
the writer will always be faster and catch up the reader.

Using the sub-buffer scheme though, and allowing concurrent writer and reader
in overwriting mode, we can easily tell the user about the writer beeing
faster and content that have been lost. On top of these informations, the
user can chose what to do: trying with a larger buffer or so.

See? It's not our role to say: the result might be unreliable if the user
does silly settings (not enough memory, reader too slow for random reasons,
too high frequency events or so...). Let the user deal with that and just
inform him about unreliable results. This is what ftrace does currently.

Also the snapshot thing doesn't look like a replacement. If you are
tracing on a low memory embedded system, you consume a lot of memory
to keep the snapshot alive, it means the live buffer can be critically
lowered and you might in turn lose traces there.
That said it's an interesting feature that may fit on other kind of
environments or for other needs.

Off-topic: It's sad that about tracing, we often have to figure out the needs
from embedded world, or learn from indirect sources. In the end we rarely
know from them directly. Except may be in confs....

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 11 Aug 2010 10:40

Egad! Go on vacation and the world falls apart.

On Wed, 2010-08-04 at 08:27 +0200, Peter Zijlstra wrote:
> On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
>
> The sub-buffer thing that both ftrace and lttng have is creating a large
> buffer from a lot of small buffers, I simply don't see the point of
> doing that. It adds complexity and limitations for very little gain.

So, I want to allocate a 10Meg buffer. I need to make sure the kernel
has 10megs of memory available. If the memory is quite fragmented, then
too bad, I lose out.

Oh wait, I could also use vmalloc. But then again, now I'm blasting
valuable TLB entries for a tracing utility, thus making the tracer have
a even bigger impact on the entire system.

BAH!

I originally wanted to go with the continuous buffer, but I was
convinced after trying to implement it, that it was a bad choice.
Specifically, because of needing to 1) get large amounts of memory that
is continuous, or 2) eating up TLB entries and causing the system to
perform poorer.

I chose page size "sub-buffers" to solve the above. It also made
implementing splice trivial. OK, I admit, I never thought about mmapping
the buffers, just because I figured splice was faster. But I do have
patches that allow a user to mmap the entire ring buffer, but only in a
"producer/consumer" mode.

Note, I use page size sub-buffers, but the design could work with any
size sub-buffers. I just never implemented that (even though, when I
wrote the code it was secretly on my todo list).

>
> Their benefit is known synchronization points into the stream, you can
> parse each sub-buffer independently, but you can always break up a
> continuous stream into smaller parts or use a transport that includes
> index points or whatever.
>
> Their down side is that you can never have individual events larger than
> the sub-buffer, you need to be aware of the sub-buffer when reserving
> space etc..

The answer to that is to make a macro to do the assignment of the event,
and add a new API.

event = ring_buffer_reserve_unlimited();

ring_buffer_assign(event, data1);
ring_buffer_assign(event, data2);

ring_buffer_commit(event);

The ring_buffer_reserve_unlimited() could reserve a bunch of space
beyond one ring buffer. It could reserve data in fragments. Then the
ring_buffer_assgin() could either copy directly to the event (if the
event exists on one sub buffer) or do a copy the space was fragmented.

Of course, userspace would need to know how to read it. And it can get
complex due to interrupts coming in and also reserving between
fragments, or what happens if a partial fragment is overwritten. But all
these are not impossible to solve.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 11 Aug 2010 10:50

On Fri, 2010-08-06 at 10:13 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz(a)infradead.org) wrote:

> Less code = less instruction cache overhead. I've also shown that the LTTng code
> is at least twice faster. In terms of complexity, it is not much more complex; I
> also took the extra care of doing the formal proofs to make sure the
> corner-cases were dealt with, which I don't reckon neither Steven nor yourself
> have done.

Yes Mathieu, you did a formal proof. Good for you. But honestly, it is
starting to get very annoying to hear you constantly stating that,
because, to most kernel developers, it is meaningless. Any slight
modification of your algorithm, renders the proof invalid.

You are not the only one that has done a proof to an algorithm in the
kernel, but you are definitely the only one that constantly reminds
people that you have done so. Congrats on your PhD, and in academia,
proofs are important.

But this is a ring buffer, not a critical part of the workings of the
kernel. There are much more critical and fragile parts of the kernel
that work fine without a formal proof.

Paul McKenney did a proof for RCU not for us, but just to help give him
a warm fuzzy about it. RCU is much more complex than the ftrace ring
buffer, and it also is much more critical. If Paul gets it wrong, a
machine will crash. He's right to worry. And even Paul told me that no
formal proof makes up for large scale testing. Which BTW, the ftrace
ring buffer has gone through.

Someday I may go ahead and do that proof, but I did do a very intensive
state diagram, and I'm quite confident that it works. It's been deployed
for quite a bit, and the design has yet to be a factor in any bug report
of the ring buffer.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 11 12 13 14 15 16 17 18 19 20 21
Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics