| 	
Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel  2.6.35_rc5 Next: input: Fix wrong dimensions check for synaptics 	
		 From: Linus Torvalds on 3 Aug 2010 15:10 On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote: > > FWIW I really utterly detest the whole concept of sub-buffers. I'm not quite sure why. Is it something fundamental, or just an implementation issue? One thing that I think could easily make sense in a _lot_ of buffering areas is the notion of a "continuation" buffer. We know we have cases where we want to attach a lot of data to one particular event, but the buffering itself is inevitably always going to have some limits on atomicity etc. And quite often, the event that _generates_ the data is not necessarily going to have all that data in one contiguous region, and doing a scatter-gather memcpy to get it that way is not good either. At the same time, I do _not_ believe that the kernel ring-buffer code should handle pointers to sub-buffers etc, or worry about iovec-like arrays of smaller ranges. So if _that_ is what you mean by "concept of sub-buffers", then I agree with you. But what I do think might make a lot of sense is to allow buffer fragments, and just teach user space to do de-fragmentation. Where it would be important that the de-fragmentation really is all in user space, and not really ever visible to the ring-buffer implementation itself (and there would not, for example, be any guarantees that the fragments would be contiguous - there could be other events in the buffer in between fragments). Maybe we could even say that fragments might be across different CPU ring-buffers, and user-space needs to sort it out if it wants to (where "sort it out" literally would mean having to sort and re-attach them in the right order, since there wouldn't be any ordering between them). From a kernel perspective, the only thing you need for fragment handling would be to have a buffer entry that just says "I'm fragment number X of event ID Y". Nothing more. Everything else would be up to the parser in user space to work out. In other words - if you have something like the current situation, where you want to save a whole back-trace, INSTEAD of allocating a large max-sized buffer for it and "linearizing" the back-trace in order to then create a backtrace ring event, maybe we could just fill the ring buffer with lots of small fragments, and do the whole linearizing in the code that reads it in user space. No temporary allocations in kernel space at all, no memcpy, let user space sort it out. Each stack level would just add its own event, and increment the fragment count it uses. It's going to be a fairly rare case, so some user space parsers might just decide to ignore fragmented packets, because they know they aren't interested in such "complex" events. I dunno. This thread has kind of devolved into many different details, and I reacted to just one very small fragment of it. Maybe not even a very interesting fragment. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Mathieu Desnoyers on 3 Aug 2010 15:50 * Linus Torvalds (torvalds(a)linux-foundation.org) wrote: > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz(a)infradead.org> wrote: > > > > FWIW I really utterly detest the whole concept of sub-buffers. > > I'm not quite sure why. Is it something fundamental, or just an > implementation issue? The real issue here, IMHO, is that Perf has tied gory ring buffer implementation details to the userspace perf ABI, and there is now strong unwillingness from Perf developers to break this ABI. About the sub-buffer definition: it only means that a buffer is splitted into many regions. Their boundary are synchronization points between the data producer and consumer. This involves padding the end of regions when records do not fit in the remaining space. I think that the problem lays in that Peter wants all his ring-buffer data to be side-to-side, without padding. He needs this because the perf ABI, presented to the user-space perf program, requires this: every implementation detail is exposed to user-space through a mmap'd memory region (yeah, even the control data is touched by both the kernel and userland through that shared page). When Perf has been initially proposed, I've thought that because the perf user-space tool is shipped along with the kernel sources, we could change the ABI easily afterward, but Peter seems to disagree and wants it to stay the as it is for backward compatibility and not offending contributors. If I had known this when the ABI first came in, I would have surely nack'd it. Now we are stucked with this ABI which exposes every tiny ring buffer implementation detail to userspace, which simply kills any future enhancement. Thanks, Mathieu P.S.: I'm holding back reply to the rest of your email to increase focus on the fundamental perf ABI problem. -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Ingo Molnar on 3 Aug 2010 16:20 * Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers > <mathieu.desnoyers(a)efficios.com> wrote: > > > > The real issue here, IMHO, is that Perf has tied gory ring buffer > > implementation details to the userspace perf ABI, and there is now strong > > unwillingness from Perf developers to break this ABI. (Wrong.) > The thing is - I think my outlined buffer fragmentation model would work > fine with the perf ABI too. Exactly because there is no deep structure, > just the same "stream of small events" both from a kernel and a user model > standpoint. Sure, the stream would now contain a new event type, but that's > trivial. It would still be _entirely_ reasonable to have the actual data in > the exact same ring buffer, including the whole mmap'ed area. Yeah. > Of course, when user space actually parses it, user space would have to > eventually defragment the event by allocating a new area and copying the > fragments together in the right order, but that's pretty trivial to do. It > certainly doesn't affect the current mmap'ed interface in the least. > > Now, whether the perf people feel they want that kind of functionality, I > don't know. It's possible that they simply do not want to handle events that > are complex enough that they would have arbitrary size. Looks useful. There's a steady trickle of new events and we already use type encapsulation for things like trace events - which are only made sense of later on in user-space. We may want to add things like a NOP event to pad out the end of page Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Ingo Molnar on 3 Aug 2010 16:30 * Ingo Molnar <mingo(a)elte.hu> wrote: > > * Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > > > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers > > <mathieu.desnoyers(a)efficios.com> wrote: > > > > > > The real issue here, IMHO, is that Perf has tied gory ring buffer > > > implementation details to the userspace perf ABI, and there is now strong > > > unwillingness from Perf developers to break this ABI. > > (Wrong.) > > > The thing is - I think my outlined buffer fragmentation model would work > > fine with the perf ABI too. Exactly because there is no deep structure, > > just the same "stream of small events" both from a kernel and a user model > > standpoint. Sure, the stream would now contain a new event type, but that's > > trivial. It would still be _entirely_ reasonable to have the actual data in > > the exact same ring buffer, including the whole mmap'ed area. > > Yeah. > > > Of course, when user space actually parses it, user space would have to > > eventually defragment the event by allocating a new area and copying the > > fragments together in the right order, but that's pretty trivial to do. It > > certainly doesn't affect the current mmap'ed interface in the least. > > > > Now, whether the perf people feel they want that kind of functionality, I > > don't know. It's possible that they simply do not want to handle events that > > are complex enough that they would have arbitrary size. > > Looks useful. There's a steady trickle of new events and we already use type > encapsulation for things like trace events - which are only made sense of > later on in user-space. > > We may want to add things like a NOP event to pad out the end of page /me once again experiences the subtle difference between 'Y' and 'N' when postponing a mail So adding fragments would be possible as well. We've got the space for such extensions in the ABI and the basic model of streaming information is not affected. [ The control structure of the mmap area is there for performance/wakeup optimizations (and to allow the kernel to lose information on producer overload, while still giving user-space an idea that we lost data and how much) - it does not affect semantics and does not limit us. ] So there's no design limitation - Peter simply prefers one possible solution over another and outlined his reasons - we should hash that out based on the technical arguments. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Linus Torvalds on 3 Aug 2010 16:40 On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> wrote: > > The real issue here, IMHO, is that Perf has tied gory ring buffer implementation > details to the userspace perf ABI, and there is now strong unwillingness from > Perf developers to break this ABI. The thing is - I think my outlined buffer fragmentation model would work fine with the perf ABI too. Exactly because there is no deep structure, just the same "stream of small events" both from a kernel and a user model standpoint. Sure, the stream would now contain a new event type, but that's trivial. It would still be _entirely_ reasonable to have the actual data in the exact same ring buffer, including the whole mmap'ed area. Of course, when user space actually parses it, user space would have to eventually defragment the event by allocating a new area and copying the fragments together in the right order, but that's pretty trivial to do. It certainly doesn't affect the current mmap'ed interface in the least. Now, whether the perf people feel they want that kind of functionality, I don't know. It's possible that they simply do not want to handle events that are complex enough that they would have arbitrary size. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |