block: fix leaks associated with discard request payload [Kernel]

Prev: [PATCH] net/Makefile: conditionally descend to wireless and ieee802154
Next: [PATCH -mm 1/2] scsi: remove dma_is_consistent usage in 53c700

From: Boaz Harrosh on 30 Jun 2010 06:30

On 06/30/2010 11:42 AM, Christoph Hellwig wrote:
> On Wed, Jun 30, 2010 at 11:32:43AM +0300, Boaz Harrosh wrote:
>> May I ask a silly question? Why the dynamic allocation?
>>
>> Why not have a const-static single global page at the block-layer somewhere
>> that will be used for all discard-type operations and be done with it once and
>> for all. A single page can be used for any size bio , any number of concurrent
>> discards, any ZERO needed operation. It can also be used by other operations
>> like padding and others. In fact isn't there one for the libsata padding?
>
> for UNMAP we need to write into the payload. And for ATA TRIM we need
> to write into the WRITE SAME payload.

OK, Thanks, I see. Is it one of these operations, (like we have in OSD) where
the CDB information spills into the payload? like the scatter-gather and extent
lists and such. Do we actually use a WRITE_SAME which is not zero? for what use?

> That's another layering violation
> for those looking for them, btw..
>

Agreed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Boaz Harrosh on 30 Jun 2010 07:00

On 06/30/2010 01:41 PM, Christoph Hellwig wrote:
> On Wed, Jun 30, 2010 at 01:25:01PM +0300, Boaz Harrosh wrote:
>> OK, Thanks, I see. Is it one of these operations, (like we have in OSD) where
>> the CDB information spills into the payload? like the scatter-gather and extent
>> lists and such.
>
> For UNMAP the payload is a list of block number / length pairs, while
> the CDB itself doesn't contain any information like that. It's a rather
> awkward command.
>

How big can that be? could we, maybe, use the sense_buffer, properly allocated
already?

>> Do we actually use a WRITE_SAME which is not zero? for what use?
>
> The kernel doesn't issue any WRITE SAME without the unmap bit set.

So if the unmap bit is set then the page can just be zero, right?

I still think a static zero-page is a worth while optimization. And
block-drivers can take care with special needs with a private mem_pool
or something. For the discard-type user and generic block layer the
page is just an implementation specific residue, No?

But don't mind me, I'm just babbling. Not that I'll do anything about it.
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: FUJITA Tomonori on 30 Jun 2010 08:00

On Mon, 28 Jun 2010 17:25:36 +0200
Christoph Hellwig <hch(a)lst.de> wrote:

> On Mon, Jun 28, 2010 at 05:14:28PM +0900, FUJITA Tomonori wrote:
> > > While I see the problems with leaking ressources in that case I still
> > > can't quite explain the hang I see.
> >
> > Any way to reproduce the hang without ssd drives?
>
> Actually the SSDs don't fully hang, they just causes lots of I/O errors
> and hit the error handler hard. The hard hang is when running under
> qemu. Apply the patch below, then create an if=scsi drive that resides
> on an XFS filesystem, and you'll have scsi TP support in the guest:

Ok, I figured out what's wrong.

As I suspected, it's due to the partial completion.

qemu scsi driver tells that the WRITE_SAME command was successful but
somehow the command has resid. So we retry it again and again (and
leak some memory).

I don't know yet why qemu scsi driver is broken. Maybe there is a bug
in it or converting discard to FS sends broken commands to the driver.

I'll try to figure out it tomorrow.

I've put a patch to complete discard command in the all-or-nothing
manner:

git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git discard

At least, the guest kernel doesn't hang for me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mike Snitzer on 30 Jun 2010 08:20

On Wed, Jun 30 2010 at 6:57am -0400,
Boaz Harrosh <bharrosh(a)panasas.com> wrote:

> On 06/30/2010 01:41 PM, Christoph Hellwig wrote:
> > On Wed, Jun 30, 2010 at 01:25:01PM +0300, Boaz Harrosh wrote:
> >> OK, Thanks, I see. Is it one of these operations, (like we have in OSD) where
> >> the CDB information spills into the payload? like the scatter-gather and extent
> >> lists and such.
> >
> > For UNMAP the payload is a list of block number / length pairs, while
> > the CDB itself doesn't contain any information like that. It's a rather
> > awkward command.
> >
>
> How big can that be? could we, maybe, use the sense_buffer, properly allocated
> already?
>
> >> Do we actually use a WRITE_SAME which is not zero? for what use?
> >
> > The kernel doesn't issue any WRITE SAME without the unmap bit set.
>
> So if the unmap bit is set then the page can just be zero, right?
>
> I still think a static zero-page is a worth while optimization. And
> block-drivers can take care with special needs with a private mem_pool
> or something. For the discard-type user and generic block layer the
> page is just an implementation specific residue, No?

Why should the block layer have any role in managing this page? Block
layer doesn't care about it, SCSI does.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: FUJITA Tomonori on 1 Jul 2010 00:30

On Wed, 30 Jun 2010 20:55:09 +0900
FUJITA Tomonori <fujita.tomonori(a)lab.ntt.co.jp> wrote:

> On Mon, 28 Jun 2010 17:25:36 +0200
> Christoph Hellwig <hch(a)lst.de> wrote:
>
> > On Mon, Jun 28, 2010 at 05:14:28PM +0900, FUJITA Tomonori wrote:
> > > > While I see the problems with leaking ressources in that case I still
> > > > can't quite explain the hang I see.
> > >
> > > Any way to reproduce the hang without ssd drives?
> >
> > Actually the SSDs don't fully hang, they just causes lots of I/O errors
> > and hit the error handler hard. The hard hang is when running under
> > qemu. Apply the patch below, then create an if=scsi drive that resides
> > on an XFS filesystem, and you'll have scsi TP support in the guest:
>
> Ok, I figured out what's wrong.
>
> As I suspected, it's due to the partial completion.
>
> qemu scsi driver tells that the WRITE_SAME command was successful but
> somehow the command has resid. So we retry it again and again (and
> leak some memory).
>
> I don't know yet why qemu scsi driver is broken. Maybe there is a bug
> in it or converting discard to FS sends broken commands to the driver.

looks like your qemu WRITE_SAME patch isn't completed :)

You implement WRITE_SAME as if it doesn't do any data transfer. So
qemu scsi driver gets resid.

The reason why WRITE_SAME works now is that scsi-ml doesn't care about
resid with PC commands but it cares with FS commands.

I confirmed that qemu scsi driver gets the identical command with both
PC and FS commands and qemu calls xfsctl.

> I've put a patch to complete discard command in the all-or-nothing
> manner:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git discard

Seems that I finished discard FS conversion. I'll update it on the top
of James' uprep patchset soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [PATCH] net/Makefile: conditionally descend to wireless and ieee802154
Next: [PATCH -mm 1/2] scsi: remove dma_is_consistent usage in 53c700