Unexpected splice "always copy" behavior observed [Kernel]

Prev: [GIT Pull] hpet for 2.6.35
Next: [GIT PULL] percpu for v2.6.35-rc1

From: Mathieu Desnoyers on 18 May 2010 11:40

Hi,

I'm currently digging into the splice code to figure out why it's always in copy
mode even though I specified the SPLICE_F_MOVE flag and released the page
references from the LTTng ring buffer. I'm splicing to a pipe and then from the
pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
something and I don't like that.

My simple test case is to add a printk around the splice copy:

fs/splice.c: pipe_to_file()
if (buf->page != page) {
/*
* Careful, ->map() uses KM_USER0!
*/
char *src = buf->ops->map(pipe, buf, 1);
char *dst = kmap_atomic(page, KM_USER1);

printk(KERN_WARNING "SPLICE COPY!!!\n");
memcpy(dst + offset, src + buf->offset, this_len);
flush_dcache_page(page);
kunmap_atomic(dst, KM_USER1);
buf->ops->unmap(pipe, buf, src);
}

I'll start with a disclaimer that I only recently improved my splice
understanding, so AFAIU:

* pipe_to_file() allocates a struct page *page on its stack.

* It is passed, uninitialized, to

ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);

that looks already odd to me, as I would expect pipe_to_file to populate
this page pointer with buf->page initially if the proper conditions are met.

* Looking at the ext2 and ext3 write_begin code, neither are using the pagep
parameter:

ext2:

static int
ext2_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
*pagep = NULL;
return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
}

ext3:

static int ext3_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
struct page *page;
....

retry:
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
*pagep = page;

* So, considering the test to check if the page content must be copied:

if (buf->page != page) {

how is it ever possible that buf->page == page ?

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 18 May 2010 12:00

Hehe, I just notice this this morning too, while investigating.

On Tue, 2010-05-18 at 11:34 -0400, Mathieu Desnoyers wrote:
> Hi,
>
> I'm currently digging into the splice code to figure out why it's always in copy
> mode even though I specified the SPLICE_F_MOVE flag and released the page
> references from the LTTng ring buffer. I'm splicing to a pipe and then from the
> pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
> something and I don't like that.
>
> My simple test case is to add a printk around the splice copy:
>
> fs/splice.c: pipe_to_file()
> if (buf->page != page) {
> /*
> * Careful, ->map() uses KM_USER0!
> */
> char *src = buf->ops->map(pipe, buf, 1);
> char *dst = kmap_atomic(page, KM_USER1);
>
> printk(KERN_WARNING "SPLICE COPY!!!\n");
> memcpy(dst + offset, src + buf->offset, this_len);
> flush_dcache_page(page);
> kunmap_atomic(dst, KM_USER1);
> buf->ops->unmap(pipe, buf, src);

I used trace_printk() since it is not as invasive.

> }
>
> I'll start with a disclaimer that I only recently improved my splice
> understanding, so AFAIU:

Same here ;-)

>
> * pipe_to_file() allocates a struct page *page on its stack.
>
> * It is passed, uninitialized, to
>
> ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
> AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
>
> that looks already odd to me, as I would expect pipe_to_file to populate
> this page pointer with buf->page initially if the proper conditions are met.
>
> * Looking at the ext2 and ext3 write_begin code, neither are using the pagep
> parameter:
>
> ext2:
>
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> *pagep = NULL;
> return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
> }
>
>
> ext3:
>
> static int ext3_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> struct page *page;
> ....
>
> retry:
> page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page)
> return -ENOMEM;
> *pagep = page;
>
> * So, considering the test to check if the page content must be copied:
>
> if (buf->page != page) {
>
> how is it ever possible that buf->page == page ?

I'm currently looking at the network code to see if it is better.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 18 May 2010 12:00

On Wed, 19 May 2010, Nick Piggin wrote:

> What would be needed is to have filesystem maintainers go through and
> enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> filesystems and I have a patch for those, but I never posted it on.yet.
> Even basic buffer head filesystems IIRC get a little more complex --
> but we may get some milage just out of invalidating the existing
> pagecache rather than getting fancy and trying to move buffers over
> to the new page.

There is a "migration" address space operation for moving pages. Page
migration requires that in order to be able to move dirty pages. Can
splice use that?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 18 May 2010 12:00

Hi,

The basic problem is that the filesystem APIs were never designed with
this usage in mind, so we had to disable the SPLICE_F_MOVE support by
default.

So short answer is that this is expected.

What would be needed is to have filesystem maintainers go through and
enable it on a case by case basis. It's trivial for tmpfs/ramfs type
filesystems and I have a patch for those, but I never posted it on.yet.
Even basic buffer head filesystems IIRC get a little more complex --
but we may get some milage just out of invalidating the existing
pagecache rather than getting fancy and trying to move buffers over
to the new page.

Nick

On Tue, May 18, 2010 at 11:34:40AM -0400, Mathieu Desnoyers wrote:
> Hi,
>
> I'm currently digging into the splice code to figure out why it's always in copy
> mode even though I specified the SPLICE_F_MOVE flag and released the page
> references from the LTTng ring buffer. I'm splicing to a pipe and then from the
> pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
> something and I don't like that.
>
> My simple test case is to add a printk around the splice copy:
>
> fs/splice.c: pipe_to_file()
> if (buf->page != page) {
> /*
> * Careful, ->map() uses KM_USER0!
> */
> char *src = buf->ops->map(pipe, buf, 1);
> char *dst = kmap_atomic(page, KM_USER1);
>
> printk(KERN_WARNING "SPLICE COPY!!!\n");
> memcpy(dst + offset, src + buf->offset, this_len);
> flush_dcache_page(page);
> kunmap_atomic(dst, KM_USER1);
> buf->ops->unmap(pipe, buf, src);
> }
>
> I'll start with a disclaimer that I only recently improved my splice
> understanding, so AFAIU:
>
> * pipe_to_file() allocates a struct page *page on its stack.
>
> * It is passed, uninitialized, to
>
> ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
> AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
>
> that looks already odd to me, as I would expect pipe_to_file to populate
> this page pointer with buf->page initially if the proper conditions are met.
>
> * Looking at the ext2 and ext3 write_begin code, neither are using the pagep
> parameter:
>
> ext2:
>
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> *pagep = NULL;
> return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
> }
>
>
> ext3:
>
> static int ext3_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> struct page *page;
> ....
>
> retry:
> page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page)
> return -ENOMEM;
> *pagep = page;
>
> * So, considering the test to check if the page content must be copied:
>
> if (buf->page != page) {
>
> how is it ever possible that buf->page == page ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 18 May 2010 12:10

On Tue, May 18, 2010 at 10:56:24AM -0500, Christoph Lameter wrote:
> On Wed, 19 May 2010, Nick Piggin wrote:
>
> > What would be needed is to have filesystem maintainers go through and
> > enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> > filesystems and I have a patch for those, but I never posted it on.yet.
> > Even basic buffer head filesystems IIRC get a little more complex --
> > but we may get some milage just out of invalidating the existing
> > pagecache rather than getting fancy and trying to move buffers over
> > to the new page.
>
> There is a "migration" address space operation for moving pages. Page
> migration requires that in order to be able to move dirty pages. Can
> splice use that?

Hmm yes I didn't think of that, it probably could.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [GIT Pull] hpet for 2.6.35
Next: [GIT PULL] percpu for v2.6.35-rc1