Trying to measure performance with splice/vmsplice .... [Kernel]

Prev: [PATCH] Oprofile: Change CPUIDS from decimal to hex, and add some comments
Next: [PATCH] hid: Add mappings for a few keys found on Logitech MX3200

From: Rick Sherm on 16 Apr 2010 13:10

Hello,

I'm trying to measure the perf gain by using splice.For now I'm trying to copy a 1G file using splice.(In real scenario, the driver will DMA the data to some buffer(which is mmap'd).The app will then write the newly-DMA'd data to the disk while some other thread is crunching the same buffer.The buffer is guaranteed to not be modified.To avoid copying I was thinking of : splice-IN-mmap'd-buffer->pipe and splice-OUT-pipe->file.)

PS - I've inlined some sloppy code that I cooked up.

Case1) read from input_file and write(O_DIRECT so no buff-cache is involved but it doesn't work) to dest_file.We can talk about the buff-cache later.

(csh#)time ./splice_to_splice

0.004u 1.451s 0:02.16 67.1% 0+0k 2097152+2097152io 0pf+0w

#define KILO_BYTE (1024)
#define PIPE_SIZE (64 * KILO_BYTE)
int filedes [2];

pipe (filedes);

fd_from = open(filename_from,O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);

to_write = 2048 * 512 * KILO_BYTE;

while (to_write) {
ret = splice (fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
SPLICE_F_MORE | SPLICE_F_MOVE);
if (ret < 0) {
printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
goto error;
} else {
ret = splice (filedes [0], NULL, fd_to,
&to_offset, PIPE_SIZE/*should be ret,but ...*/,
SPLICE_F_MORE | SPLICE_F_MOVE);
if (ret < 0) {
printf("Error: LINE:%d ret:%d\n",__LINE__);
goto error;
}
to_write -= ret;
}
}

Case 2) directly reading and writing:

Case2.1) copy 64K blocks

(csh#)time ./file_to_file 64
0.015u 1.066s 0:04.04 26.4% 0+0k 2097152+2097152io 0pf+0w

#define KILO_BYTE (1024)
#define MEGA_BYTE (1024 * (KILO_BYTE))
#define BUFF_SIZE (64 * MEGA_BYTE)

posix_memalign((void**)&buff,4096,BUFF_SIZE);

fd_from = open(filename_from,(O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);

/* 1G file == 2048 * 512K blocks */
to_write = 2048 * 512 * KILO_BYTE;
copy_size = cmd_line_input * KILO_BYTE; /* control from cmd_line */
while (to_write) {
ret = read(fd_from, buff,copy_size);
if (ret != copy_size) {
printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
goto error;
} else {
ret = write (fd_to,buff,copy_size);
if (ret != copy_size) {
printf("Error: LINE:%d ret:%d\n",__LINE__);
goto error;
}
to_write -= ret;
}
}

Case2.2) copy 512K blocks

(csh#)time ./file_to_file 512
0.004u 0.306s 0:01.86 16.1% 0+0k 2097152+2097152io 0pf+0w

Case 2.3) copy 1M blocks
time ./file_to_file 1024
0.000u 0.240s 0:01.88 12.7% 0+0k 2097152+2097152io 0pf+0w

Questions:
Q1) When using splice,why is the CPU consumption greater than read/write(case 2.1)?What does this mean?

Q2) How do I confirm that the memory bandwidth consumption does not spike up when using splice in this case? By this I mean, (node)cpu<->mem. The DMA-in/DMA-out will happen.You can't escape from that but the IOH-bus will be utilized. I want to keep the cpu(node)-mem path free(well, minimize unnecessary copies).

Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat.

r b swpd free buff cache
1 0 0 9358820 116576 2100904

../splice_to_splice

r b swpd free buff cache
2 0 0 7228908 116576 4198164

I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too.

Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant.

I would appreciate any pointers.

thanks
Rick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rick Sherm on 23 Apr 2010 16:00

Hello Jens,

--- On Fri, 4/23/10, Jens Axboe <jens.axboe(a)oracle.com> wrote:
> I still have patches pending for this, making the pipe
> buffer count
> settable form user space:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f
>
> Let me know if you want to give it a spin on a recent
> kernel, and I'll
> update it.
>

I think we need to adjust 'PIPE_BUFFERS' in default_file_splice_read() also,correct?

> Jens Axboe

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rick Sherm on 23 Apr 2010 12:10

Hello Jens - any assistance/pointers on 1) and 2) below
will be great.I'm willing to test out any sample patch.

Steve,

--- On Wed, 4/21/10, Steven J. Magnani <steve(a)digidescorp.com> wrote:
> Hi Rick,
>
> On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > Q3) When using splice, even though the destination
> file is opened in O_DIRECT mode, the data gets cached. I
> verified it using vmstat.
> >
> > r b��swpd��free��buff cache��
> > 1 0� ��0 9358820 116576 2100904
> >
> > ./splice_to_splice
> >
> > r b swpd��free��buff cache
> > 2 0� 0 7228908 116576� 4198164
> >
> > I see the same caching issue even if I vmsplice
> buffers(simple malloc'd iov) to a pipe and then splice the
> pipe to a file. The speed is still an issue with vmsplice
> too.
> >
>
> One thing is that O_DIRECT is a hint; not all filesystems
> bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't.
>
> Another variable is whether (and how) your filesystem
> implements the splice_write file operation. The generic one (pipe_to_file)
> in fs/splice.c copies data to pagecache. The default one goes
> out to vfs_write() and might stand more of a chance of honoring
> O_DIRECT.
>

True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.

> > Q4) Also, using splice, you can only transfer 64K
> worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> using stock read/write, I can go upto 1MB buffer. After that
> I don't see any gain. But still the reduction in system/cpu
> time is significant.
>
> I'm not a splicing expert but I did spend some time
> recently trying to
> improve FTP reception by splicing from a TCP socket to a
> file. I found that while splicing avoids copying packets to userland,
> that gain is more than offset by a large increase in calls into the
> storage stack.It's especially bad with TCP sockets because a typical
> packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> pages at a time, and packet pages are only about 35% utilized, each
> cycle to userland I could only move 23 KiB of data at most. Some
> similar effect may be in play in your case.
>

Agreed,increasing number of calls will offset the benefit.
But what if:
1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
What are the implications in the other parts of the kernel?
2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).

> Regards,
>� Steven J. Magnani� � � � ��

regards
++Rick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven J. Magnani on 23 Apr 2010 13:00

On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote:
> Hello Jens - any assistance/pointers on 1) and 2) below
> will be great.I'm willing to test out any sample patch.

Recent mail from him has come from jens.axboe(a)oracle.com, I cc'd it.

>
> Steve,
>
> --- On Wed, 4/21/10, Steven J. Magnani <steve(a)digidescorp.com> wrote:
> > Hi Rick,
> >
> > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > > Q3) When using splice, even though the destination
> > file is opened in O_DIRECT mode, the data gets cached. I
> > verified it using vmstat.
> > >
> > > r b swpd free buff cache
> > > 1 0 0 9358820 116576 2100904
> > >
> > > ./splice_to_splice
> > >
> > > r b swpd free buff cache
> > > 2 0 0 7228908 116576 4198164
> > >
> > > I see the same caching issue even if I vmsplice
> > buffers(simple malloc'd iov) to a pipe and then splice the
> > pipe to a file. The speed is still an issue with vmsplice
> > too.
> > >
> >
> > One thing is that O_DIRECT is a hint; not all filesystems
> > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't.
> >
> > Another variable is whether (and how) your filesystem
> > implements the splice_write file operation. The generic one (pipe_to_file)
> > in fs/splice.c copies data to pagecache. The default one goes
> > out to vfs_write() and might stand more of a chance of honoring
> > O_DIRECT.
> >
>
> True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
>
> > > Q4) Also, using splice, you can only transfer 64K
> > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> > using stock read/write, I can go upto 1MB buffer. After that
> > I don't see any gain. But still the reduction in system/cpu
> > time is significant.
> >
> > I'm not a splicing expert but I did spend some time
> > recently trying to
> > improve FTP reception by splicing from a TCP socket to a
> > file. I found that while splicing avoids copying packets to userland,
> > that gain is more than offset by a large increase in calls into the
> > storage stack.It's especially bad with TCP sockets because a typical
> > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> > pages at a time, and packet pages are only about 35% utilized, each
> > cycle to userland I could only move 23 KiB of data at most. Some
> > similar effect may be in play in your case.
> >
>
> Agreed,increasing number of calls will offset the benefit.
> But what if:
> 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
> What are the implications in the other parts of the kernel?

This came up recently, one problem is that there a couple of kernel
functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So
the stack cost of increasing PIPE_BUFFERS can be quite high. I've
thought it might be nice if there was some mechanism for userland apps
to be able to request larger PIPE_BUFFERS values, but I haven't pursued
this line of thought to see if it's practical.

> 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).

It's a neat idea, but it would probably be much easier (and less
invasive) to try this sort of pipelining in userland using a ring buffer
or ping-pong approach. I'm actually in the middle of something like this
with FTP, where I will have a reader thread that puts data from the
network into a ring buffer, from which a writer thread moves it to a
file.

------------------------------------------------------------------------
Steven J. Magnani "I claim this network for MARS!
www.digidescorp.com Earthling, return my space modulator!"

#include <standard.disclaimer>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 23 Apr 2010 13:10

On Fri, Apr 23 2010, Steven J. Magnani wrote:
> On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote:
> > Hello Jens - any assistance/pointers on 1) and 2) below
> > will be great.I'm willing to test out any sample patch.
>
> Recent mail from him has come from jens.axboe(a)oracle.com, I cc'd it.

Goes to the same inbox in the end, so no difference :-)

> > > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > > > Q3) When using splice, even though the destination
> > > file is opened in O_DIRECT mode, the data gets cached. I
> > > verified it using vmstat.
> > > >
> > > > r b swpd free buff cache
> > > > 1 0 0 9358820 116576 2100904
> > > >
> > > > ./splice_to_splice
> > > >
> > > > r b swpd free buff cache
> > > > 2 0 0 7228908 116576 4198164
> > > >
> > > > I see the same caching issue even if I vmsplice
> > > buffers(simple malloc'd iov) to a pipe and then splice the
> > > pipe to a file. The speed is still an issue with vmsplice
> > > too.
> > > >
> > >
> > > One thing is that O_DIRECT is a hint; not all filesystems
> > > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't.
> > >
> > > Another variable is whether (and how) your filesystem
> > > implements the splice_write file operation. The generic one (pipe_to_file)
> > > in fs/splice.c copies data to pagecache. The default one goes
> > > out to vfs_write() and might stand more of a chance of honoring
> > > O_DIRECT.
> > >
> >
> > True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
> >
> > > > Q4) Also, using splice, you can only transfer 64K
> > > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> > > using stock read/write, I can go upto 1MB buffer. After that
> > > I don't see any gain. But still the reduction in system/cpu
> > > time is significant.
> > >
> > > I'm not a splicing expert but I did spend some time
> > > recently trying to
> > > improve FTP reception by splicing from a TCP socket to a
> > > file. I found that while splicing avoids copying packets to userland,
> > > that gain is more than offset by a large increase in calls into the
> > > storage stack.It's especially bad with TCP sockets because a typical
> > > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> > > pages at a time, and packet pages are only about 35% utilized, each
> > > cycle to userland I could only move 23 KiB of data at most. Some
> > > similar effect may be in play in your case.
> > >
> >
> > Agreed,increasing number of calls will offset the benefit.
> > But what if:
> > 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
> > What are the implications in the other parts of the kernel?
>
> This came up recently, one problem is that there a couple of kernel
> functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So
> the stack cost of increasing PIPE_BUFFERS can be quite high. I've
> thought it might be nice if there was some mechanism for userland apps
> to be able to request larger PIPE_BUFFERS values, but I haven't pursued
> this line of thought to see if it's practical.

I still have patches pending for this, making the pipe buffer count
settable form user space:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f

Let me know if you want to give it a spin on a recent kernel, and I'll
update it.

> > 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).
>
> It's a neat idea, but it would probably be much easier (and less
> invasive) to try this sort of pipelining in userland using a ring buffer
> or ping-pong approach. I'm actually in the middle of something like this
> with FTP, where I will have a reader thread that puts data from the
> network into a ring buffer, from which a writer thread moves it to a
> file.

See vmsplice.c from the splice test tools:

http://brick.kernel.dk/snaps/splice-git-latest.tar.gz

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] Oprofile: Change CPUIDS from decimal to hex, and add some comments
Next: [PATCH] hid: Add mappings for a few keys found on Logitech MX3200