vmscan: delegate pageout io to flusher thread if current is kswapd [Kernel]

Prev: [PATCH 28/35] union-mount: Implement union-aware link()
Next: Lockup inside of stop_machine() during modprobe aes (was Re: Another AR5008 hang)

From: Suleiman Souhlal on 15 Apr 2010 19:50

On Thu, Apr 15, 2010 at 4:33 PM, Dave Chinner <david(a)fromorbit.com> wrote:
> On Thu, Apr 15, 2010 at 10:27:09AM -0700, Suleiman Souhlal wrote:
>>
>> On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
>>
>> >On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>> >>
>> >>On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>> >>
>> >>>Now, vmscan pageout() is one of IO throuput degression source.
>> >>>Some IO workload makes very much order-0 allocation and reclaim
>> >>>and pageout's 4K IOs are making annoying lots seeks.
>> >>>
>> >>>At least, kswapd can avoid such pageout() because kswapd don't
>> >>>need to consider OOM-Killer situation. that's no risk.
>> >>>
>> >>>Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
>> >>
>> >>What's your opinion on trying to cluster the writes done by pageout,
>> >>instead of not doing any paging out in kswapd?
>> >
>> >XFS already does this in ->writepage to try to minimise the impact
>> >of the way pageout issues IO. It helps, but it is still not as good
>> >as having all the writeback come from the flusher threads because
>> >it's still pretty much random IO.
>>
>> Doesn't the randomness become irrelevant if you can cluster enough
>> pages?
>
> No. If you are doing full disk seeks between random chunks, then you
> still lose a large amount of throughput. e.g. if the seek time is
> 10ms and your IO time is 10ms for each 4k page, then increasing the
> size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> throughput but we are still limited to 100 IOs per second. We've
> gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> short of the 100MB/s full size IOs with little in way of seeks
> between them will acheive on the same spindle...

What I meant was that, theoretically speaking, you could increase the
maximum amount of pages that get clustered so that you could get
100MB/s, although it most likely wouldn't be a good idea with the
current patch.

-- Suleiman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ying Han on 19 Apr 2010 23:00

On Thu, Apr 15, 2010 at 3:30 AM, Johannes Weiner <hannes(a)cmpxchg.org> wrote:
> On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
>> Cc to Johannes
>>
>> > >
>> > > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>> > >
>> > > > Now, vmscan pageout() is one of IO throuput degression source.
>> > > > Some IO workload makes very much order-0 allocation and reclaim
>> > > > and pageout's 4K IOs are making annoying lots seeks.
>> > > >
>> > > > At least, kswapd can avoid such pageout() because kswapd don't
>> > > > need to consider OOM-Killer situation. that's no risk.
>> > > >
>> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
>> > >
>> > > What's your opinion on trying to cluster the writes done by pageout,
>> > > instead of not doing any paging out in kswapd?
>> > > Something along these lines:
>> >
>> > Interesting.
>> > So, I'd like to review your patch carefully. can you please give me one
>> > day? :)
>>
>> Hannes, if my remember is correct, you tried similar swap-cluster IO
>> long time ago. now I can't remember why we didn't merged such patch.
>> Do you remember anything?
>
> Oh, quite vividly in fact :) �For a lot of swap loads the LRU order
> diverged heavily from swap slot order and readaround was a waste of
> time.
>
> Of course, the patch looked good, too, but it did not match reality
> that well.
>
> I guess 'how about this patch?' won't get us as far as 'how about
> those numbers/graphs of several real-life workloads? �oh and here
> is the patch...'.

Hannes,

We recently ran into this problem while running some experiments on
ext4 filesystem. We experienced the scenario where we are writing a
large file or just opening a large file with limited memory allocation
(using containers), and the process got OOMed. The memory assigned to
the container is reasonably large, and the OOM can not be reproduced
on ext2 with the same configurations.

Later we figured this might be due to the delayed block allocation
from ext4. Vmscan sends a single page to ext4->writepage(), then ext4
punts if the block is DA'ed and re-dirties the page. On the other
hand, the flusher thread use ext4->writepages() which does include the
block allocation.

We looked at the OOM log under ext4, all pages within the container
were in inactive list and either Dirty or WriteBack. Also, the zones
are all marked as "all_unreclaimable" which indicates the reclaim path
has scanned the LRU quite lot times without making progress. If the
delayed block allocation is the cause for pageout() not being able to
flush dirty pages and then triggers OOMs, should we signal the fs to
force write out dirty pages under memory pressure?

--Ying

>
>> > > � � �Cluster writes to disk due to memory pressure.
>> > >
>> > > � � �Write out logically adjacent pages to the one we're paging out
>> > > � � �so that we may get better IOs in these situations:
>> > > � � �These pages are likely to be contiguous on disk to the one we're
>> > > � � �writing out, so they should get merged into a single disk IO.
>> > >
>> > > � � �Signed-off-by: Suleiman Souhlal <suleiman(a)google.com>
>
> For random IO, LRU order will have nothing to do with mapping/disk order.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo(a)kvack.org. �For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont(a)kvack.org"> email(a)kvack.org </a>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [PATCH 28/35] union-mount: Implement union-aware link()
Next: Lockup inside of stop_machine() during modprobe aes (was Re: Another AR5008 hang)