Unexpected splice "always copy" behavior observed [Kernel]

Prev: [GIT Pull] hpet for 2.6.35
Next: [GIT PULL] percpu for v2.6.35-rc1

From: Linus Torvalds on 19 May 2010 12:10

On Wed, 19 May 2010, Miklos Szeredi wrote:
>
> And predictability is good. The thing I don't like about the above is
> that it makes it totally unpredictable which pages will get moved, if
> any.

Tough.

Think of it this way: it is predictable. They get predictably moved when
moving is cheap and easy. It's about _performance_.

Do you know when TLB misses happen? They are unpredictable. Do you know
when the OS sends IPI's around? Do you know when scheduling happens?

No you don't. So stop whining.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 19 May 2010 12:30

On Wed, May 19, 2010 at 11:57:32AM -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt(a)goodmis.org) wrote:
> > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > Btw, since you apparently have a real case - is the "splice to file"
> > > > always just an append? IOW, if I'm not right in assuming that the only
> > > > sane thing people would reasonable care about is "append to a file", then
> > > > holler now.
> > >
> > > Virtual machines might reasonably need this for splicing to a disk
> > > image.
> >
> > This comes down to balancing speed and complexity. Perhaps a copy is
> > fine in this case.
> >
> > I'm concerned about high speed tracing, where we are always just taking
> > pages from the trace ring buffer and appending them to a file or sending
> > them off to the network. The slower this is, the more likely you will
> > lose events.
> >
> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
>
> I'm with Steven here. I only care about appending full pages at the end of a
> file. If possible, I'd also like to steal back the pages after waiting for the
> writeback I/O to complete so we can put them back in the ring buffer without
> stressing the page cache and the page allocator needlessly.

Got to think about complexity and how much is really worth trying to
speed up strange cases. The page allocator is the generic "pipe" in
the kernel to move pages between subsystems when they become unused :)

The page cache can be directed to be written out and discarded with
fadvise and such.

You might also consider using direct IO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 19 May 2010 12:40

On Thu, 2010-05-20 at 01:55 +1000, Nick Piggin wrote:
> On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:

> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
>
> It's not "easy" to implement :) What's your ring buffer look like?
> Is it a normal user address which the kernel does copy_to_user()ish
> things into? Or a mmapped special driver?

Neither ;-)

>
> If the latter, it get's even harder again. But either way if the
> source pages just have to be regenerated anyway (eg. via page fault
> on next access), then it might not even be worthwhile to do the
> splice move.

The ring buffer is written to by kernel events. To read it, the user can
either do a sys_read() and that is copied, or use splice. I do not
support mmap(), and if we were to do that, it would then not support
splice(). We have been talking about implementing both but with flags on
allocation of the ring buffer. You can either support mmap() or splice()
but not both with one instance of the ring buffer.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mathieu Desnoyers on 19 May 2010 15:20

* Nick Piggin (npiggin(a)suse.de) wrote:
> On Wed, May 19, 2010 at 11:57:32AM -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt(a)goodmis.org) wrote:
> > > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > > Btw, since you apparently have a real case - is the "splice to file"
> > > > > always just an append? IOW, if I'm not right in assuming that the only
> > > > > sane thing people would reasonable care about is "append to a file", then
> > > > > holler now.
> > > >
> > > > Virtual machines might reasonably need this for splicing to a disk
> > > > image.
> > >
> > > This comes down to balancing speed and complexity. Perhaps a copy is
> > > fine in this case.
> > >
> > > I'm concerned about high speed tracing, where we are always just taking
> > > pages from the trace ring buffer and appending them to a file or sending
> > > them off to the network. The slower this is, the more likely you will
> > > lose events.
> > >
> > > If the "move only on append to file" is easy to implement, I would
> > > really like to see that happen. The speed of splicing a disk image for a
> > > virtual machine only impacts the patience of the user. The speed of
> > > splicing tracing output, impacts how much you can trace without losing
> > > events.
> >
> > I'm with Steven here. I only care about appending full pages at the end of a
> > file. If possible, I'd also like to steal back the pages after waiting for the
> > writeback I/O to complete so we can put them back in the ring buffer without
> > stressing the page cache and the page allocator needlessly.
>
> Got to think about complexity and how much is really worth trying to
> speed up strange cases. The page allocator is the generic "pipe" in
> the kernel to move pages between subsystems when they become unused :)
>
> The page cache can be directed to be written out and discarded with
> fadvise and such.

Good point. This discard flag might do the trick and let us keep things simple.
The major concern here is to keep the page cache disturbance relatively low.
Which of new page allocation or stealing back the page has the lowest overhead
would have to be determined with benchmarks.

So I would tend to simply use this discard fadvise with new page allocation for
now.

>
> You might also consider using direct IO.

Maybe. I'm unsure about what it implies in the splice() context though.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 19 May 2010 15:40

On Wed, 19 May 2010, Mathieu Desnoyers wrote:
>
> Good point. This discard flag might do the trick and let us keep things simple.
> The major concern here is to keep the page cache disturbance relatively low.
> Which of new page allocation or stealing back the page has the lowest overhead
> would have to be determined with benchmarks.

We could probably make it easier somehow to do the writeback and discard
thing, but I have had _very_ good experiences with even a rather trivial
file writer that basically used (iirc) 8MB windows, and the logic was very
trivial:

- before writing a new 8M window, do "start writeback"
(SYNC_FILE_RANGE_WRITE) on the previous window, and do
a wait (SYNC_FILE_RANGE_WAIT_AFTER) on the window before that.

in fact, in its simplest form, you can do it like this (this is from my
"overwrite disk images" program that I use on old disks):

for (index = 0; index < max_index ;index++) {
if (write(fd, buffer, BUFSIZE) != BUFSIZE)
break;
/* This won't block, but will start writeout asynchronously */
sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
/* This does a blocking write-and-wait on any old ranges */
if (index)
sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
}

and even if you don't actually do a discard (maybe we should add a
SYNC_FILE_RANGE_DISCARD bit, right now you'd need to do a separate
fadvise(FADV_DONTNEED) to throw it out) the system behavior is pretty
nice, because the heavy writer gets good IO performance _and_ leaves only
easy-to-free pages around after itself.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [GIT Pull] hpet for 2.6.35
Next: [GIT PULL] percpu for v2.6.35-rc1