From: Nigel Cunningham on
Hi all.

I've just given hibernation a go under 2.6.35, and at first I thought
there was some sort of hang in freezing processes. The computer sat
there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
swsusp to see if it was specific to my code but no - the problem was
there too. I used the nifty new kdb support to get a backtrace, which was:

get_swap_page_of_type
discard_swap_cluster
blk_dev_issue_discard
wait_for_completion

Adding a printk in discard swap cluster gives the following:

[ 46.758330] Discarding 256 pages from bdev 800003 beginning at page
640377.
[ 47.003363] Discarding 256 pages from bdev 800003 beginning at page
640633.
[ 47.246514] Discarding 256 pages from bdev 800003 beginning at page
640889.

....

[ 221.877465] Discarding 256 pages from bdev 800003 beginning at page
826745.
[ 222.121284] Discarding 256 pages from bdev 800003 beginning at page
827001.
[ 222.365908] Discarding 256 pages from bdev 800003 beginning at page
827257.
[ 222.610311] Discarding 256 pages from bdev 800003 beginning at page
827513.

So allocating 4GB of swap on my SSD now takes 176 seconds instead of
virtually no time at all. (This code is completely unchanged from 2.6.34).

I have a couple of questions:

1) As far as I can see, there haven't been any changes in mm/swapfile.c
that would cause this slowdown, so something in the block layer has
(from my point of view) regressed. Is this a known issue?

2) Why are we calling discard_swap_cluster anyway? The swap was unused
and we're allocating it. I could understand calling it when freeing
swap, but when allocating?

Regards,

Nigel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Mark" == Mark Lord <kernel(a)teksavvy.com> writes:

Mark> Looks to me like more and more things are using the block discard
Mark> functionality, and as predicted it is slowing things down
Mark> enormously.

Mark> The problem is that we still only discard tiny bits (a single
Mark> range still??) per TRIM command, rather than batching larger
Mark> ranges and larger numbers of ranges into single TRIM commands.

Mark> That's a very poor implementation, especially when things start
Mark> enabling it by default. Eg. the swap code, mke2fs, etc..

I'm working on aggregation. But it's harder than we initially
thought...

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Hugh Dickins on
On Fri, Aug 13, 2010 at 4:54 AM, Christoph Hellwig <hch(a)infradead.org> wrote:
> On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote:
>> If REQ_SOFTBARRIER means that the device is still free to reorder a
>> write, which was issued after discard completion was reported, before
>> the discard (so later discarding the data written), then certainly I
>> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
>> unavoidable there; but if not, then it's not needed for the swap case.
>>  I hope to gain a little more enlightenment on such barriers shortly.
>
> REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block
> elevator.
>
>> What does seem over the top to me, is for mm/swapfile.c's
>> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
>> BLKDEV_IFL_BARRIER: those swap discards were originally written just
>> to use barriers, without needing to wait for completion in there.  I'd
>> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
>> swap discards behave acceptably again for you - but understand that
>> you won't have a chance to try that until later next week.
>
> That does indeed look incorrect to me.  Any kind of explicit waits
> usually mean the caller provides ordering.  Getting rid of
> BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial
> given that we are trying to get rid of hard barriers completely soon.
> Auditing the existing blkdev_issue_discard callers in filesystems
> is high on the todo list for me.

Yes.

Above I was suggesting for Nigel to experiment with cutting out swap
discard's BLKDEV_IFL_WAITs - and the results of cutting those out but
leaving its BLKDEV_IFL_BARRIERs would still be interesting. But after
digesting the LSF discussion and the email thread that led up to it, I
came to the same conclusion as you, that going forward we want to keep
its BLKDEV_IFL_WAITs (swapfile.c already provides all the other
synchronization for that to fit into - things like never freeing swap
while its still under writeback) and simply remove its
BLKDEV_IFL_BARRIERs.

However, I am still not quite sure that we can already make that
change for 2.6.35 (-stable). Can you reassure me on the question I
raise above: if we issue a discard to a device with cache, wait for
"completion", then issue a write into the area spanned by that
discard, can we be certain that the write to backing store will not be
reordered before the discard of backing store (unless the device is
just broken)? Without a REQ_HARDBARRIER in the 2.6.35 scheme? It
seems a very reasonable assumption to me, but I'm learning not to
depend upon reasonable assumptions here. (By the way, it doesn't
matter at all whether writes not spanned by the discard pass it or
not.)

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/