block: fix leaks associated with discard request payload [Kernel]

Prev: [PATCH] net/Makefile: conditionally descend to wireless and ieee802154
Next: [PATCH -mm 1/2] scsi: remove dma_is_consistent usage in 53c700

From: FUJITA Tomonori on 27 Jun 2010 12:30

On Sun, 27 Jun 2010 17:35:45 +0200
Christoph Hellwig <hch(a)lst.de> wrote:

> On Sun, Jun 27, 2010 at 10:16:40AM -0400, Mike Snitzer wrote:
> > My leak fixes have been tested extensively against all permuations of
> > devices with discards (ATA trim, SCSI UNMAP, SCSI WRTIE SAME w/ unmap=1).
> >
> > I think we need to get Christoph's discard payload transformation
> > complete by fixing the leaks _without_ trying to rework how discard
> > commands are tagged, etc. E.g. fix what Jens already has staged in
> > linux-2.6-block's 'for-next' and 'for-2.6.36'.
> >
> > With that sorted out we can then look at longer term changes to cleanup
> > discard request processing.
>
> I tend to agree. I'll look into getting rid of treating discards as
> BLOCK_PC commands ontop of you patchset.

Let's start from jens' 2.6.36 tree (or it would be better if Jens can
drop Christoph's patch).

We don't need to start on the top of new discard hacks (specially, the
hack in scsi_finish_command looks terrible). We should not merge them
into mainline. We are still in rc3.

I'll see how things work with other configurations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: FUJITA Tomonori on 28 Jun 2010 04:20

On Mon, 28 Jun 2010 09:57:38 +0200
Christoph Hellwig <hch(a)lst.de> wrote:

> On Sun, Jun 27, 2010 at 09:32:07PM +0900, FUJITA Tomonori wrote:
> > On Sun, 27 Jun 2010 13:07:12 +0200
> > Christoph Hellwig <hch(a)lst.de> wrote:
> >
> > > > How about this?
> > >
> > > As I tried to explain before this utterly confuses the I/O completion
> > > path. With the patch applied even a simple mkfs.xfs that issues discard
> > > just hangs.
> >
> > Wired. I just tried mkfs.xfs against scsi_debug with my block patches
> > (I saw one discard command). Seemed that it worked fine.
>
> I've tracked it down to the call to scsi_requeue_command in scsi_end_request.
> When the command is marked BLOCK_PC we'll just get it back as such in
> ->prep_fn next time, but now it's reverting to the previous state.

If scsi_end_request() calls scsi_requeue_command(), the command has a
left over (i.e. hasn't finished all the data), right? You hit such
condition with discard commands?

BLOCK_PC requests don't hit this case since blk_end_request() always
return false for PC.

> While I see the problems with leaking ressources in that case I still
> can't quite explain the hang I see.

Any way to reproduce the hang without ssd drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Boaz Harrosh on 29 Jun 2010 04:10

On 06/28/2010 08:16 PM, Martin K. Petersen wrote:
> I'm still dreaming of the day where libata moves out from under SCSI so
> we don't have to translate square pegs into round holes...
>

Hi Martin.

Is that on the Radar still? what would you say are the major issues holding
it back?

* Is it just the missing new ULD, Will we be using a new ULD?
* Is it the ata midlayer that would duplicate some of scsi-midlayer.
(Not so much these days, lots of good code went into blk_rq_)
* Is it the distro's setup procedures changing yet again back to the
hd vs sd times.

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mike Snitzer on 29 Jun 2010 20:00

On Tue, Jun 29 2010 at 7:03pm -0400,
James Bottomley <James.Bottomley(a)suse.de> wrote:

> On Tue, 2010-06-29 at 18:28 -0400, Mikulas Patocka wrote:
> >
> > On Sun, 27 Jun 2010, James Bottomley wrote:
> >
> > > linux-scsi cc added, since it's a SCSI patch.
> > >
> > > On Sat, 2010-06-26 at 15:56 -0400, Mike Snitzer wrote:
> > > > Fix leaks introduced via "block: don't allocate a payload for discard
> > > > request" commit a1d949f5f44.
> > > >
> > > > sd_done() is not called for REQ_TYPE_BLOCK_PC commands so cleanup
> > > > discard request's payload directly in scsi_finish_command().
> > > >
> > > > Also cleanup page allocated for discard payload in
> > > > scsi_setup_discard_cmnd's scsi_setup_blk_pc_cmnd error path.
> > > >
> > > > Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
> > > > ---
> > > > block/blk-core.c | 23 +++++++++++++++++++++++
> > > > drivers/scsi/scsi.c | 8 ++++++++
> > > > drivers/scsi/sd.c | 18 ++++++++----------
> > > > include/linux/blkdev.h | 1 +
> > > > 4 files changed, 40 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 98b4cee..07925aa 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -1167,6 +1167,29 @@ void blk_add_request_payload(struct request *rq, struct page *page,
> > > > }
> > > > EXPORT_SYMBOL_GPL(blk_add_request_payload);
> > > >
> > > > +/**
> > > > + * blk_clear_request_payload - clear a request's payload
> > > > + * @rq: request to update
> > > > + *
> > > > + * The driver needs to take care of freeing the payload itself.
> > > > + */
> > > > +void blk_clear_request_payload(struct request *rq)
> > > > +{
> > > > + struct bio *bio = rq->bio;
> > > > +
> > > > + rq->__data_len = rq->resid_len = 0;
> > > > + rq->nr_phys_segments = 0;
> > > > + rq->buffer = NULL;
> > > > +
> > > > + bio->bi_size = 0;
> > > > + bio->bi_vcnt = 0;
> > > > + bio->bi_phys_segments = 0;
> > > > +
> > > > + bio->bi_io_vec->bv_page = NULL;
> > > > + bio->bi_io_vec->bv_len = 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(blk_clear_request_payload);
> > > > +
> > > > void init_request_from_bio(struct request *req, struct bio *bio)
> > > > {
> > > > req->cpu = bio->bi_comp_cpu;
> > > > diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> > > > index ad0ed21..69c7ea4 100644
> > > > --- a/drivers/scsi/scsi.c
> > > > +++ b/drivers/scsi/scsi.c
> > > > @@ -851,6 +851,14 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
> > > > */
> > > > if (good_bytes == old_good_bytes)
> > > > good_bytes -= scsi_get_resid(cmd);
> > > > + } else if (cmd->request->cmd_flags & REQ_DISCARD) {
> > > > + /*
> > > > + * If this is a discard request that originated from the kernel
> > > > + * we need to free our payload here. Note that we need to check
> > > > + * the request flag as the normal payload rules apply for
> > > > + * pass-through UNMAP / WRITE SAME requests.
> > > > + */
> > > > + __free_page(bio_page(cmd->request->bio));
> > >
> > > This is another layering violation: the page is allocated in the Upper
> > > layer and freed in the mid-layer.
> > >
> > > I really hate these growing contortions for discard. They're a clear
> > > signal that we haven't implemented it right.
> > >
> > > So let's first work out how it should be done. I really like Tomo's
> > > idea of doing discard through the normal REQ_TYPE_FS route, which means
> > > we can control the setup in prep and the tear down in done, all confined
> > > to the ULD.
> > >
> > > Where I think I'm at is partially what Christoph says: The command
> > > transformation belongs in the ULD so that's where the allocation and
> > > deallocation should be, and partly what Tomo says in that we should
> > > eliminate the special case paths.
> > >
> > > The payload vs actual request size should be a red herring if we've got
> > > everything correct: only the ULD cares about the request parameters.
> > > Once we've got everything set up, the mid layer and LLD should only care
> > > about the parameters in the command, so we can confine the size changing
> > > part to the ULD doing the discard.
> > >
> > > Could someone take a stab at this and see if it works without layering
> > > violations or any other problematic signals?
> > >
> > > Thanks,
> > >
> > > James
> >
> > Well, I think that you overestimate the importance of scsi code too much.
>
> Not, I think, a deadly sin for a SCSI maintainer.

Indeed ;)

> > There is a layering violation in the code. So what --- you either fix the
> > layering violation or let it be there and grind your teeth on it. But in
> > either case, that layering violation won't affect anyone except scsi
> > developers.
>
> A layering violation is a signal of bad design wherever it occurs, so
> that wasn't a SCSI centric argument.
>
> > On the other hand, if you say "because we want to avoid layering violation
> > in SCSI, every issuer of discard request must supply an empty page", you
> > create havoc all over the Linux codebase. md, dm, drbd, xvd, virtio ---
> > whatever you think of, will be allocating a dummy page when constructing
> > a discard request.
>
> Since I didn't actually say any of that, I suggest you re-read text you
> quoted above. The phrase "The command transformation belongs in the ULD
> so that's where the allocation and deallocation should be" might be a
> relevant one to concentrate on.

Right, freeing the page, that was allocated in SCSI's ULD, from the SCSI
midlayer is a SCSI layering violation. I think Mikulas was reacting to
the desire to maintain the existing, arguably more problematic, layering
violation that spans the block and SCSI layers.

> > If the layering violation spans only scsi code, it can be eventually
> > fixed, but this, much worse "layering violation" that will be spanning all
> > block device midlayers, won't ever be fixed.
> >
> > Imagine for example --- a discard request arrivers at a dm-snapshot
> > device. The driver splits it into chunks, remaps each chunk to the
> > physical chunk, submits the requests, the elevator merges adjacent
> > requests and submits fewer bigger requests to the device. Now, if you had
> > to allocate a zeroed page each time you are splitting the request, that
> > would exhaust memory and burn cpu needlessly. You delete a 100MB file? ---
> > fine, allocate a 100MB of zeroed pages.
>
> This is a straw man: You've tried to portray a position I've never
> taken as mine then attack it ... with what is effectively another bogus
> argument.
>
> It's not an either/or choice. I've asked the relevant parties to
> combine the approaches and see if a REQ_TYPE_FS path that does the
> allocations in the appropriate place, likely the ULD, produces a good
> design.

If in the end we can fix up SCSI properly then everyone is happy. So
lets just keep working toward that. The various attempts to convert
discard over to REQ_TYPE_FS have fallen short. Hopefully we'll have a
break through shortly.

Thanks for your guidance James,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Boaz Harrosh on 30 Jun 2010 04:40

On 06/27/2010 06:29 PM, James Bottomley wrote:
<snip>
>> + /*
>> + * If this is a discard request that originated from the kernel
>> + * we need to free our payload here. Note that we need to check
>> + * the request flag as the normal payload rules apply for
>> + * pass-through UNMAP / WRITE SAME requests.
>> + */
>> + __free_page(bio_page(cmd->request->bio));
>
> This is another layering violation: the page is allocated in the Upper
> layer and freed in the mid-layer.
>

May I ask a silly question? Why the dynamic allocation?

Why not have a const-static single global page at the block-layer somewhere
that will be used for all discard-type operations and be done with it once and
for all. A single page can be used for any size bio , any number of concurrent
discards, any ZERO needed operation. It can also be used by other operations
like padding and others. In fact isn't there one for the libsata padding?

(It could be dynamical allocated on first use for embedded system)

just my $0.017
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [PATCH] net/Makefile: conditionally descend to wireless and ieee802154
Next: [PATCH -mm 1/2] scsi: remove dma_is_consistent usage in 53c700