x86_64 page fault NMI-safe [Kernel]

Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics

From: Mathieu Desnoyers on 5 Aug 2010 21:50

* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>
> > How do you plan to read the data concurrently with the writer overwriting the
> > data while you are reading it without corruption ?
>
> I don't consider reading while writing (in overwrite mode) a valid case.
>
> If you want to use overwrite, stop the writer before reading it.

How inconvenient. It happens that the relatively large group of users I am
working for do care for this use-case. They cannot afford to stop tracing as
soon as they hit "one bug". This "bug" could be a simple odd scenario that they
want to snapshot, but in all cases they want tracing to continue.

>
> > I think that the stack dump
> > should simply be saved directly to the ring buffer, without copy. The
> > dump_stack() functions might have to be extended so they don't just save text
> > dumbly, but can also be used to save events into the trace in binary format,
> > perhaps with the continuation cookie Linus was proposing.
>
> Because I don't want to support truncating reservations (because that
> leads to large nops for nested events)

Agreed in this case. Truncating reservations might make sense for filtering, but
even there I have a strong preference for filtering directly on the information
received as parameter, before performing buffer space reservation, whenever
possible.

> and when the event needs to go to
> multiple buffers you can re-use the stack-dump without having to do the
> unwind again.
>
> The problem with the continuation thing Linus suggested is that it would
> bloat the output 3 fold. A stack entry is a single u64. If you want to
> wrap that in a continuation event you need: a header (u64), a cookie
> (u64) and the entry (u64).

Agreed, it's probably not such a good fit for these small pieces of information.

>
> Continuation events might make heaps of sense for larger data pieces,
> but I don't see them being practical for such small pieces.

Yep.

What I did in a past life in earlier LTTng versions was to use a 2-pass unwind.
The first pass is the most costly because it brings all the data into the L1
cache. This first pass is used to compute the array size you need to save the
whole stack frame, but it does not copy anything. The second pass performs the
copy. This was surprisingly efficient.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mathieu Desnoyers on 5 Aug 2010 21:50

* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
>
> > The first major gain is the ability to implement flight recorder tracing
> > (overwrite mode), which Perf still lacks.
>
> http://lkml.org/lkml/2009/7/6/178
>
> I've send out something like that several times, but nobody took it
> (that is, tested it and provided a user). Note how it doesn't require
> anything like sub-buffers.

+static void perf_output_tail(struct perf_mmap_data *data, unsigned int head)
....
+ unsigned long tail, new;
....
+ unsigned long size;

+ while (tail + size - head < 0) {
.....
+ }

How is the while condition ever be supposed to be true ? I guess nobody took it
because it simply was not ready for testing.

>
> > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > the trace very efficiently by allowing it to perform a binary search for time to
> > find the appropriate sub-buffer. It becomes immensely useful with large traces.
>
> You can add sync events with a specific magic cookie in. Once you find
> the cookie you can sync and start reading it reliably

You need to read the whole trace to find these cookies (even if it is just once
at the beginning if you create an index). My experience with users have shown me
that the delay between stopping trace gathering having the data shown to the
user is very important, because this is repeatedly done while debugging a
problem, and this is time the user is sitting in front of his screen, waiting.

> -- the advantage
> is that sync events are very easy to have as an option and don't
> complicate the reserve path.

Perf, on its reserve/commit fast paths:

perf_output_begin: 543 bytes
(perf_output_get_handle is inlined)

perf_output_put_handle: 201 bytes
perf_output_end: 77 bytes
calls perf_output_put_handle

Total for perf: 821 bytes

Generic Ring Buffer Library reserve/commit fast paths:

Reserve: 511 bytes
Commit: 266 bytes
Total for Generic Ring Buffer: 777 bytes

So the generic ring buffer is not only faster and supports sub-buffers (along
with all the nice features this brings); its reserve and commit hot paths
fit in less instructions: it is *less* complicated than Perf's.

>
> > The third major gain: for live streaming of traces, having sub-buffer lets you
> > "package" the event data you send over the network into sub-buffers.
>
> See the sync events.

I am guessing you plan to rely on these sync events to know which data "blocs"
are fully received. This could possibly be made to work.

> Also, a transport can rewrite the stream any which
> way it pretty well wants to, as long as the kernel<->user interface is
> reliable an unreliable user<->user transport can repackage it to suit
> its needs.

repackage = copy = poor performance. No thanks.

>
> > Making sure events don't cross sub-buffer boundaries simplify a lot of things,
> > starting with dealing with "overwritten" sub-buffers in flight recorder mode.
> > Trying to deal with a partially overwritten event is just insane.
>
> See the above patch, simply parse the events and push the tail pointer
> ahead of the reservation before you trample on it.

I'm not sure that patch is ready for prime-time yet. As you point out in your
following email, you need to stop tracing to consume data, which does not fit my
users'use-cases.

>
> If you worry about the cost of parsing the events, you can amortize that
> by things like keeping the offset of the first event in every page in
> the pageframe, or the offset of the next sync event or whatever scheme
> you want.

Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
won't be exported to user-space, so how is the parser supposed to see this
information exactly to help it speeding up parsing ?

>
> Again, no need for sub-buffers.

I don't see this claim as satisfactorily supported here, sorry.

>
> Also, not having sub-buffers makes reservation easier since you don't
> need to worry about those empty tails.

So far I've shown that you sub-buffer-less implementation is not even simpler
than a implementation using sub-buffers.

By the way, even with your sub-buffer free scheme, you cannot write an event
bigger than your buffer size. So you have a likewise limitation in terms of
maximum event size (so you already have to test this on your fast path).

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Masami Hiramatsu on 6 Aug 2010 02:20

Peter Zijlstra wrote:
> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>
>> How do you plan to read the data concurrently with the writer overwriting the
>> data while you are reading it without corruption ?
>
> I don't consider reading while writing (in overwrite mode) a valid case.
>
> If you want to use overwrite, stop the writer before reading it.

For example, would you like to read system audit log always after
stop the audit?

NO, that's a most important requirement for tracers, especially for
system admins (they're the most important users of Linux) to check
the system health and catch system troubles.

For performance measurement and checking hotspot, one-shot tracing
is enough. But it's just for developers. But for the real world
computing, Linux is just an OS, users want to run their system,
middleware and applications, without troubles. But when they hit
a trouble, they wanna shoot it ASAP.
The flight recorder mode is mainly for those users.

Thank you,

--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt(a)hitachi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 6 Aug 2010 06:00

On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> Peter Zijlstra wrote:
> > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> >
> >> How do you plan to read the data concurrently with the writer overwriting the
> >> data while you are reading it without corruption ?
> >
> > I don't consider reading while writing (in overwrite mode) a valid case.
> >
> > If you want to use overwrite, stop the writer before reading it.
>
> For example, would you like to read system audit log always after
> stop the audit?
>
> NO, that's a most important requirement for tracers, especially for
> system admins (they're the most important users of Linux) to check
> the system health and catch system troubles.
>
> For performance measurement and checking hotspot, one-shot tracing
> is enough. But it's just for developers. But for the real world
> computing, Linux is just an OS, users want to run their system,
> middleware and applications, without troubles. But when they hit
> a trouble, they wanna shoot it ASAP.
> The flight recorder mode is mainly for those users.

You cannot over-write and consistently read the buffer, that's plain
impossible. With sub-buffers you can swivel a sub-buffer and
consistently read that, but there is no guarantee the next sub-buffer
you steal was indeed adjacent to the previous buffer you stole as that
might have gotten over-written by the active writer while you were
stealing the previous one.

If you want to snapshot buffers, do that, simply swivel the whole trace
buffer, and continue tracing in a new one, then consume the old trace in
a consistent manner.

I really see no value in being able to read unrelated bits and pieces of
a buffer.

So no, I will _not_ support reading an over-write buffer while there is
an active reader.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 6 Aug 2010 06:20

On Thu, 2010-08-05 at 21:42 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz(a)infradead.org) wrote:
> > On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
> >
> > > The first major gain is the ability to implement flight recorder tracing
> > > (overwrite mode), which Perf still lacks.
> >
> > http://lkml.org/lkml/2009/7/6/178
> >
> > I've send out something like that several times, but nobody took it
> > (that is, tested it and provided a user). Note how it doesn't require
> > anything like sub-buffers.

> How is the while condition ever be supposed to be true ? I guess nobody took it
> because it simply was not ready for testing.

I know, I never claimed it was, it was always an illustration of how to
accomplish it. But then, nobody found it important enough to finish.

> > > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > > the trace very efficiently by allowing it to perform a binary search for time to
> > > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> >
> > You can add sync events with a specific magic cookie in. Once you find
> > the cookie you can sync and start reading it reliably
>
> You need to read the whole trace to find these cookies (even if it is just once
> at the beginning if you create an index).

Depends on what you want to do, you can start reading at any point in
the stream and be guaranteed to find a sync point within sync-distance
+max-event-size.

> My experience with users have shown me
> that the delay between stopping trace gathering having the data shown to the
> user is very important, because this is repeatedly done while debugging a
> problem, and this is time the user is sitting in front of his screen, waiting.

Yeah, because after having had to wait for 36h for the problem to
trigger that extra minute really kills.

All I can say is that in my experience brain throughput is the limiting
factor in debugging. Not some ability to draw fancy pictures.

> > -- the advantage
> > is that sync events are very easy to have as an option and don't
> > complicate the reserve path.
>
> Perf, on its reserve/commit fast paths:
>
> perf_output_begin: 543 bytes
> (perf_output_get_handle is inlined)
>
> perf_output_put_handle: 201 bytes
> perf_output_end: 77 bytes
> calls perf_output_put_handle
>
> Total for perf: 821 bytes
>
> Generic Ring Buffer Library reserve/commit fast paths:
>
> Reserve: 511 bytes
> Commit: 266 bytes
> Total for Generic Ring Buffer: 777 bytes
>
> So the generic ring buffer is not only faster and supports sub-buffers (along
> with all the nice features this brings); its reserve and commit hot paths
> fit in less instructions: it is *less* complicated than Perf's.

All I can say is that less code doesn't equal less complex (nor faster
per-se). Nor have I spend all my time on writing the ring-buffer,
there's more interesting things to do.

And the last time I ran perf on perf, the buffer wasn't the thing that
was taking most time.

And unlike what you claim below, it most certainly can deal with events
larger than a single page.

> > If you worry about the cost of parsing the events, you can amortize that
> > by things like keeping the offset of the first event in every page in
> > the pageframe, or the offset of the next sync event or whatever scheme
> > you want.
>
> Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
> won't be exported to user-space, so how is the parser supposed to see this
> information exactly to help it speeding up parsing ?

Its about the kernel parsing the buffer to push the tail ahead of the
reserve window, so that you have a reliable point to start reading the
trace from -- or didn't you actually get the intent of that patch?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics