2.6.35 Regression: Ages spent discarding blocks that weren't used! [Kernel]

Prev: linux-next: build failure after merge of the final tree (cifs tree related)
Next: [PATCH 3/3] powerpc-silence-xics_migrate_irqs_away-during-cpu-offline

From: Hugh Dickins on 5 Aug 2010 00:00

On Wed, Aug 4, 2010 at 2:22 PM, Nigel Cunningham <nigel(a)tuxonice.net> wrote:
> On 04/08/10 22:44, Mark Lord wrote:
>>
>> Looks to me like more and more things are using the block discard
>> functionality, and as predicted it is slowing things down enormously.
>>
>> The problem is that we still only discard tiny bits (a single range
>> still??)
>> per TRIM command, rather than batching larger ranges and larger numbers
>> of ranges into single TRIM commands.
>>
>> That's a very poor implementation, especially when things start enabling
>> it by default. Eg. the swap code, mke2fs, etc..
>>
>> Ugh.

swap has been discarding since 2.6.29, on one 1MB range at a time.
There's been no significant change at the swap end since then, but I
guess more devices have been announcing themselves as nonrotational
and supporting discard, and the implementation lower down has gone
through a number of changes.

>
> I was hoping for a nice quick and simple answer. Since I haven't got one,
> I'll try to find time to do a git bisect. I think I'll also look at the swap
> code more carefully and see if it's doing the sensible thing. I can't (at
> the moment) see the logic behind calling discard when allocating swap. At
> freeing time makes much more sense to me.

I agree it would make more sense to discard swap when freeing rather
than when allocating, I wish we could. But at the freeing point we're
often holding a page_table spinlock at an outer level, and it's just
one page we're given to free. Freeing is an operation you want to be
comfortable doing when you're short of resources, whereas discard is a
kind of I/O operation which needs resources.

It happens that in the allocation path, there was already a place at
which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
when I say 1MB), so that was the neatest point at which to site the
discard - though even there we have to be careful about racing
allocations.

I did once try to go back and get it to work when freeing instead of
allocating, gathering the swap slots up then freeing when convenient.
It was messy, didn't work very well, and didn't show an improvement in
performance (on what we were testing at the time).

I've not been able to test swap, with SSDs, for several months: that's
a dreadful regression you've found, thanks a lot for reporting it:
I'll be very interested to hear where you locate the cause. If it
needs changes to the way swap does discard, so be it.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nigel Cunningham on 5 Aug 2010 02:30

Hi Hugh.

Thanks for the email.

On 05/08/10 13:58, Hugh Dickins wrote:
> On Wed, Aug 4, 2010 at 2:22 PM, Nigel Cunningham<nigel(a)tuxonice.net> wrote:
>> On 04/08/10 22:44, Mark Lord wrote:
>>>
>>> Looks to me like more and more things are using the block discard
>>> functionality, and as predicted it is slowing things down enormously.
>>>
>>> The problem is that we still only discard tiny bits (a single range
>>> still??)
>>> per TRIM command, rather than batching larger ranges and larger numbers
>>> of ranges into single TRIM commands.
>>>
>>> That's a very poor implementation, especially when things start enabling
>>> it by default. Eg. the swap code, mke2fs, etc..
>>>
>>> Ugh.
>
> swap has been discarding since 2.6.29, on one 1MB range at a time.
> There's been no significant change at the swap end since then, but I
> guess more devices have been announcing themselves as nonrotational
> and supporting discard, and the implementation lower down has gone
> through a number of changes.

Okay; that's good to know.

>>
>> I was hoping for a nice quick and simple answer. Since I haven't got one,
>> I'll try to find time to do a git bisect. I think I'll also look at the swap
>> code more carefully and see if it's doing the sensible thing. I can't (at
>> the moment) see the logic behind calling discard when allocating swap. At
>> freeing time makes much more sense to me.
>
> I agree it would make more sense to discard swap when freeing rather
> than when allocating, I wish we could. But at the freeing point we're
> often holding a page_table spinlock at an outer level, and it's just
> one page we're given to free. Freeing is an operation you want to be
> comfortable doing when you're short of resources, whereas discard is a
> kind of I/O operation which needs resources.
>
> It happens that in the allocation path, there was already a place at
> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
> when I say 1MB), so that was the neatest point at which to site the
> discard - though even there we have to be careful about racing
> allocations.

Makes sense when you put it like that :)

I know it's a bit messier, but would it be possible for us to modify the
behaviour depending on the reason for the allocation? (No page_table
spinlock holding when we're hibernating).

The issue isn't as noticable with [u]swsusp at the moment because
they're allocating swap as the image is being written. If my current set
of patches for Rafael get accepted, that will change (swap will be
preallocated).

TuxOnIce always allocates all available storage since there's (usually)
virtually zero cost of doing so and it then doesn't matter how much the
drivers allocate when we do the atomic copy, or how good a compression
ratio is achieved. That's what I'm aiming for in my patches for [u]swsusp.

> I did once try to go back and get it to work when freeing instead of
> allocating, gathering the swap slots up then freeing when convenient.
> It was messy, didn't work very well, and didn't show an improvement in
> performance (on what we were testing at the time).

For one or two at a time, I can see that would be the case. If it is
possible to do the discard of pages used for hibernation after we're
finished reading the image, that would be good. Even better would be to
only do the discard for pages that were actually used and just do a
simple free for ones that were only allocated.

Of course I'm talking in ideals without having an intimate knowledge of
the swap allocation code or exactly how ugly the above would make it :)

> I've not been able to test swap, with SSDs, for several months: that's
> a dreadful regression you've found, thanks a lot for reporting it:
> I'll be very interested to hear where you locate the cause. If it
> needs changes to the way swap does discard, so be it.

I'm traveling to the US on Saturday and have apparently been given one
of those nice seats with power, so I'll try and get the bisection done then.

TTFN

Nigel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Hugh Dickins on 5 Aug 2010 21:20

On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham <nigel(a)tuxonice.net> wrote:
> On 05/08/10 13:58, Hugh Dickins wrote:
>> I agree it would make more sense to discard swap when freeing rather
>> than when allocating, I wish we could. But at the freeing point we're
>> often holding a page_table spinlock at an outer level, and it's just
>> one page we're given to free. Freeing is an operation you want to be
>> comfortable doing when you're short of resources, whereas discard is a
>> kind of I/O operation which needs resources.
>>
>> It happens that in the allocation path, there was already a place at
>> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
>> when I say 1MB), so that was the neatest point at which to site the
>> discard - though even there we have to be careful about racing
>> allocations.
>
> Makes sense when you put it like that :)
>
> I know it's a bit messier, but would it be possible for us to modify the
> behaviour depending on the reason for the allocation? (No page_table
> spinlock holding when we're hibernating).

But if we moved it to the swap free, it would occur in the swap free
(if any) _prior_ to hibernating, when testing for hibernation would
just say "no".

>> I did once try to go back and get it to work when freeing instead of
>> allocating, gathering the swap slots up then freeing when convenient.
>> It was messy, didn't work very well, and didn't show an improvement in
>> performance (on what we were testing at the time).

When I tried gathering together the frees, there were just too many
short extents to make the discards worth doing that way.

>
> For one or two at a time, I can see that would be the case. If it is
> possible to do the discard of pages used for hibernation after we're
> finished reading the image, that would be good. Even better would be to only
> do the discard for pages that were actually used and just do a simple free
> for ones that were only allocated.

There are optimizations which could be done e.g. we discard the whole
of swap at swapon time, then re-discard each 1MB as we begin to
allocate from it. Clearly that has a certain stupidity to it! But
the initial discard of the whole of swap should be efficient and worth
doing. We could keep track of which swap pages have already been
discarded since last used, but that would take up another... it's not
immediately clear to me whether it's another value or another bit of
the swap count.

We could provide an interface for hibernation, to do a minimal number
of maximal range discards to suit hibernation (but we need to be very
careful about ongoing allocation, as in another thread).

But are these things worth doing? I think not. Discard is supposed
to be helping not hindering: if the device only supports discard in a
way that's so inefficient, we'd do better to blacklist it as not
supporting discard at all. My suspicion is that your SSD is of that
kind: that it used not to be recognized as supporting discard, but now
in 2.6.35 it is so recognized. However, that's just a suspicion: let
me not slander your SSD when it may be my code or someone else's to
blame: needs testing.

>
> Of course I'm talking in ideals without having an intimate knowledge of the
> swap allocation code or exactly how ugly the above would make it :)
>
>> I've not been able to test swap, with SSDs, for several months: that's
>> a dreadful regression you've found, thanks a lot for reporting it:
>> I'll be very interested to hear where you locate the cause. If it
>> needs changes to the way swap does discard, so be it.
>
> I'm traveling to the US on Saturday and have apparently been given one of
> those nice seats with power, so I'll try and get the bisection done then.

That would be helpful, but displays greater dedication than I'd offer myself!

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nigel Cunningham on 6 Aug 2010 00:50

Hi.

On 06/08/10 11:15, Hugh Dickins wrote:
> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel(a)tuxonice.net> wrote:
>> On 05/08/10 13:58, Hugh Dickins wrote:
>>> I agree it would make more sense to discard swap when freeing rather
>>> than when allocating, I wish we could. But at the freeing point we're
>>> often holding a page_table spinlock at an outer level, and it's just
>>> one page we're given to free. Freeing is an operation you want to be
>>> comfortable doing when you're short of resources, whereas discard is a
>>> kind of I/O operation which needs resources.

The more I think about this, the more it seems to me that doing the
discard at allocation time is just wrong. How about a strategy in which
you do the discard immediately if resources permit, or flag it as in
need of doing (at a future swap free of that page or swap off time) if
things are too pressured at the moment? I haven't put thought into what
data structures could be used for that - just want to ask for now if
you'd be happy with the idea of looking into a strategy like that.

>>> It happens that in the allocation path, there was already a place at
>>> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
>>> when I say 1MB), so that was the neatest point at which to site the
>>> discard - though even there we have to be careful about racing
>>> allocations.
>>
>> Makes sense when you put it like that :)
>>
>> I know it's a bit messier, but would it be possible for us to modify the
>> behaviour depending on the reason for the allocation? (No page_table
>> spinlock holding when we're hibernating).
>
> But if we moved it to the swap free, it would occur in the swap free
> (if any) _prior_ to hibernating, when testing for hibernation would
> just say "no".

I see what you mean. I'm not so worried by that because if we're having
to free swap in order to hibernate, things are going to be slow anyway
and discards aren't going to stand out nearly so much.

I'm much more concerned by the fact that we're currently doing discards
for swap that wasn't even being used. If I reboot, swapon and then
hibernate without having touched swap so far, I still get the 'hit' on
the whole 4GB. I don't think we should be storing on disk a "this hasn't
been used since last discarded" flag, so I'm looking for other solutions.

>>> I did once try to go back and get it to work when freeing instead of
>>> allocating, gathering the swap slots up then freeing when convenient.
>>> It was messy, didn't work very well, and didn't show an improvement in
>>> performance (on what we were testing at the time).
>
> When I tried gathering together the frees, there were just too many
> short extents to make the discards worth doing that way.

Okay.

>> For one or two at a time, I can see that would be the case. If it is
>> possible to do the discard of pages used for hibernation after we're
>> finished reading the image, that would be good. Even better would be to only
>> do the discard for pages that were actually used and just do a simple free
>> for ones that were only allocated.
>
> There are optimizations which could be done e.g. we discard the whole
> of swap at swapon time, then re-discard each 1MB as we begin to
> allocate from it. Clearly that has a certain stupidity to it! But
> the initial discard of the whole of swap should be efficient and worth
> doing. We could keep track of which swap pages have already been
> discarded since last used, but that would take up another... it's not
> immediately clear to me whether it's another value or another bit of
> the swap count.
>
> We could provide an interface for hibernation, to do a minimal number
> of maximal range discards to suit hibernation (but we need to be very
> careful about ongoing allocation, as in another thread).
>
> But are these things worth doing? I think not. Discard is supposed
> to be helping not hindering: if the device only supports discard in a
> way that's so inefficient, we'd do better to blacklist it as not
> supporting discard at all. My suspicion is that your SSD is of that
> kind: that it used not to be recognized as supporting discard, but now
> in 2.6.35 it is so recognized. However, that's just a suspicion: let
> me not slander your SSD when it may be my code or someone else's to
> blame: needs testing.

I've done the bisect now - spent the time today instead of on the place,
and it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is
Christoph's Hellwig's patch "block: fix DISCARD_BARRIER requests".

Could the problem be that we're using DISCARD_FL_BARRIER too much? (I'm
still looking at the code here, so am writing without having thought
everything through _really_ carefully.

>> Of course I'm talking in ideals without having an intimate knowledge of the
>> swap allocation code or exactly how ugly the above would make it :)
>>
>>> I've not been able to test swap, with SSDs, for several months: that's
>>> a dreadful regression you've found, thanks a lot for reporting it:
>>> I'll be very interested to hear where you locate the cause. If it
>>> needs changes to the way swap does discard, so be it.
>>
>> I'm traveling to the US on Saturday and have apparently been given one of
>> those nice seats with power, so I'll try and get the bisection done then.
>
> That would be helpful, but displays greater dedication than I'd offer myself!

Done without an airplane now. I don't have to be credited with too much
dedication! :)

Nigel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Hugh Dickins on 6 Aug 2010 18:10

On Thu, Aug 5, 2010 at 9:40 PM, Nigel Cunningham <nigel(a)tuxonice.net> wrote:
> On 06/08/10 11:15, Hugh Dickins wrote:
>> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel(a)tuxonice.net>
>> wrote:
>>>>> On 05/08/10 13:58, Hugh Dickins wrote:
>>>>
>>>> I agree it would make more sense to discard swap when freeing rather
>>>> than when allocating, I wish we could. But at the freeing point we're
>>>> often holding a page_table spinlock at an outer level, and it's just
>>>> one page we're given to free. Freeing is an operation you want to be
>>>> comfortable doing when you're short of resources, whereas discard is a
>>>> kind of I/O operation which needs resources.
>
> The more I think about this, the more it seems to me that doing the discard
> at allocation time is just wrong. How about a strategy in which you do the
> discard immediately if resources permit, or flag it as in need of doing (at
> a future swap free of that page or swap off time) if things are too
> pressured at the moment? I haven't put thought into what data structures
> could be used for that - just want to ask for now if you'd be happy with the
> idea of looking into a strategy like that.

We're going round in circles here. I have already agreed with you
that doing it at swap free time seems more natural, but explained that
there are implementation difficulties there, so doing it at allocation
time proved both much less messy and more efficient. I can imagine
advances that would sway me to putting in more effort there
(duplicating the scan for a free 1MB every time a page of swap is
freed? doesn't sound efficient to me, but if it saves us from an
inefficiency of too many or too late discards, perhaps it could work
out). However, a recently introduced regression does not make a
strong case for adding in hacks - not yet anyway.

> I've done the bisect now - spent the time today instead of on the place, and

That's great, many thanks!

> it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is Christoph's
> Hellwig's patch "block: fix DISCARD_BARRIER requests".

That's
block: fix DISCARD_BARRIER requests

Filesystems assume that DISCARD_BARRIER are full barriers, so that they
don't have to track in-progress discard operation when submitting new I/O.
But currently we only treat them as elevator barriers, which don't
actually do the nessecary queue drains.

where the discard barrier changes from REQ_SOFTBARRIER to REQ_HARDBARRIER.

If REQ_SOFTBARRIER means that the device is still free to reorder a
write, which was issued after discard completion was reported, before
the discard (so later discarding the data written), then certainly I
agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
unavoidable there; but if not, then it's not needed for the swap case.
I hope to gain a little more enlightenment on such barriers shortly.

What does seem over the top to me, is for mm/swapfile.c's
blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
BLKDEV_IFL_BARRIER: those swap discards were originally written just
to use barriers, without needing to wait for completion in there. I'd
be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
swap discards behave acceptably again for you - but understand that
you won't have a chance to try that until later next week.

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: linux-next: build failure after merge of the final tree (cifs tree related)
Next: [PATCH 3/3] powerpc-silence-xics_migrate_irqs_away-during-cpu-offline