Downsides to madvise/fadvise(willneed) for application startup [Kernel]

Prev: CONFIDENTIAL
Next: asm-generic: add NEED_SG_DMA_LENGTH to define sg_dma_len()

From: Taras Glek on 6 Apr 2010 18:10

On 04/05/2010 04:52 PM, Roland Dreier wrote:
> Almost certainly teaching my grandmother to suck eggs, but are you aware
> of the work Michael Meeks has done on improving openoffice.org startup time?
>
Yes. There were some stones left unturned in the cold startup area.
Turns out that every single large application suffers from low io
throughput likely due to lack of cooperation between the dynamic linker
and the kernel.
There is a glibc bug filed on that.

http://sourceware.org/bugzilla/show_bug.cgi?id=11431

Unfortunately, few userspace people seem to know exactly how madvise()
hints behave, so I was hoping someone on LKML would clue me in.

Taras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Johannes Weiner on 6 Apr 2010 18:30

On Tue, Apr 06, 2010 at 02:57:30PM -0700, Taras Glek wrote:
> On 04/06/2010 02:51 AM, Johannes Weiner wrote:
> >On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >
> >>Hello,
> >>I am working on improving Mozilla startup times. It turns out that page
> >>faults(caused by lack of cooperation between user/kernelspace) are the
> >>main cause of slow startup. I need some insights from someone who
> >>understands linux vm behavior.
> >>
> >>Current Situation:
> >>The dynamic linker mmap()s executable and data sections of our
> >>executable but it doesn't call madvise().
> >>By default page faults trigger 131072byte reads. To make matters worse,
> >>the compile-time linker + gcc lay out code in a manner that does not
> >>correspond to how the resulting executable will be executed(ie the
> >>layout is basically random). This means that during startup 15-40mb
> >>binaries are read in basically random fashion. Even if one orders the
> >>binary optimally, throughput is still suboptimal due to the puny
> >>readahead.
> >>
> >>IO Hints:
> >>Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >>reads and a binary that tends to take 110 page faults(ie program stops
> >>execution and waits for disk) can be reduced down to 6. This has the
> >>potential to double application startup of large apps without any clear
> >>downsides. Suse ships their glibc with a dynamic linker patch to
> >>fadvise() dynamic libraries(not sure why they switched from doing
> >>madvise before).
> >>
> >>I filed a glibc bug about this at
> >>http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >>with his concern about wasting memory resources. What is the impact of
> >>madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >>pressure? Does the kernel simply start ignoring these hints?
> >>
> >It will throttle based on memory pressure. In idle situations it will
> >eat your file cache, however, to satisfy the request.
> >
> Define idle situations. Do you mean that madv(willneed) will aggresively
> readahead, but only while cpu(or disk?) is idle?
> I am trying to optimize application startup which means that the cpu is
> busy while not blocked on io.

Sorry. I meant without memory pressure. It will trigger readahead for the
whole page range immediately, unless the sum of free pages and file cache
pages is less than that.

So yes, it will be aggressive against the cache but should not touch things
frequently in use or start swapping for example.

> >>Also, once an application is started is it reasonable to keep it
> >>madvise(WILLNEED)ed or should the madvise flags be reset?
> >>
> >It's a one-time operation that starts immediate readahead, no permanent
> >changes are done.
> >
> I may be measuring this wrong, but in my experience the only change
> madvise(willneed) does in increase the length parameter to
> __do_page_cache_readahead(). My script is at
> http://hg.mozilla.org/users/tglek_mozilla.com/startup/file/6453ad2a7906/kernelio.stp
> .

Whether the page is read on a major fault or by means of WILLNEED,
they both end up calling this function. It's just that faulting
does all the heuristics and WILLNEED will just force reading the
pages in the specified range.

But your question whether it would be reasonable to keep the region
WILLNEED madvised makes no sense. It's just a request to prepopulate
the page cache from disk data immediately instead of waiting for
faults to trigger the reads.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Taras Glek on 6 Apr 2010 18:50

On 04/06/2010 03:26 PM, Johannes Weiner wrote:
> On Tue, Apr 06, 2010 at 02:57:30PM -0700, Taras Glek wrote:
>
>> On 04/06/2010 02:51 AM, Johannes Weiner wrote:
>>
>>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>>>
>>>
>>>> Hello,
>>>> I am working on improving Mozilla startup times. It turns out that page
>>>> faults(caused by lack of cooperation between user/kernelspace) are the
>>>> main cause of slow startup. I need some insights from someone who
>>>> understands linux vm behavior.
>>>>
>>>> Current Situation:
>>>> The dynamic linker mmap()s executable and data sections of our
>>>> executable but it doesn't call madvise().
>>>> By default page faults trigger 131072byte reads. To make matters worse,
>>>> the compile-time linker + gcc lay out code in a manner that does not
>>>> correspond to how the resulting executable will be executed(ie the
>>>> layout is basically random). This means that during startup 15-40mb
>>>> binaries are read in basically random fashion. Even if one orders the
>>>> binary optimally, throughput is still suboptimal due to the puny
>>>> readahead.
>>>>
>>>> IO Hints:
>>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>>>> reads and a binary that tends to take 110 page faults(ie program stops
>>>> execution and waits for disk) can be reduced down to 6. This has the
>>>> potential to double application startup of large apps without any clear
>>>> downsides. Suse ships their glibc with a dynamic linker patch to
>>>> fadvise() dynamic libraries(not sure why they switched from doing
>>>> madvise before).
>>>>
>>>> I filed a glibc bug about this at
>>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>>>> with his concern about wasting memory resources. What is the impact of
>>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>>>> pressure? Does the kernel simply start ignoring these hints?
>>>>
>>>>
>>> It will throttle based on memory pressure. In idle situations it will
>>> eat your file cache, however, to satisfy the request.
>>>
>>>
>> Define idle situations. Do you mean that madv(willneed) will aggresively
>> readahead, but only while cpu(or disk?) is idle?
>> I am trying to optimize application startup which means that the cpu is
>> busy while not blocked on io.
>>
> Sorry. I meant without memory pressure. It will trigger readahead for the
> whole page range immediately, unless the sum of free pages and file cache
> pages is less than that.
>
> So yes, it will be aggressive against the cache but should not touch things
> frequently in use or start swapping for example.
>
Perfect.
>
>>>> Also, once an application is started is it reasonable to keep it
>>>> madvise(WILLNEED)ed or should the madvise flags be reset?
>>>>
>>>>
>>> It's a one-time operation that starts immediate readahead, no permanent
>>> changes are done.
>>>
>>>
>> I may be measuring this wrong, but in my experience the only change
>> madvise(willneed) does in increase the length parameter to
>> __do_page_cache_readahead(). My script is at
>> http://hg.mozilla.org/users/tglek_mozilla.com/startup/file/6453ad2a7906/kernelio.stp
>> .
>>
> Whether the page is read on a major fault or by means of WILLNEED,
> they both end up calling this function. It's just that faulting
> does all the heuristics and WILLNEED will just force reading the
> pages in the specified range.
>
> But your question whether it would be reasonable to keep the region
> WILLNEED madvised makes no sense. It's just a request to prepopulate
> the page cache from disk data immediately instead of waiting for
> faults to trigger the reads.
>
Ok. Thanks for clarifying that. I was misinterpreting my io log.
Is there a way to force page faults from a particular memory mapping to
do more readahead? Ie if WILLNEED is not used.

Have heuristics that read backwards been considered? Ie currently if one
faults in page at offset 4096, that page a few pages following that will
be preread. Would be interesting to try to preread pages before and
after the page being faulted in.
For a graph of "backwards" io see the "Post-linker Fail" section in
http://blog.mozilla.com/tglek/2010/03/24/linux-why-loading-binaries-from-disk-sucks/

Taras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 6 Apr 2010 22:30

Hi Taras,

On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> > Hello,
> > I am working on improving Mozilla startup times. It turns out that page
> > faults(caused by lack of cooperation between user/kernelspace) are the
> > main cause of slow startup. I need some insights from someone who
> > understands linux vm behavior.

How about improve Fedora (and other distros) to preload Mozilla (and
other apps the user run at the previous boot) with fadvise() at boot
time? This sounds like the most reasonable option.

As for the kernel readahead, I have a patchset to increase default
mmap read-around size from 128kb to 512kb (except for small memory
systems). This should help your case as well.

> > Current Situation:
> > The dynamic linker mmap()s executable and data sections of our
> > executable but it doesn't call madvise().
> > By default page faults trigger 131072byte reads. To make matters worse,
> > the compile-time linker + gcc lay out code in a manner that does not
> > correspond to how the resulting executable will be executed(ie the
> > layout is basically random). This means that during startup 15-40mb
> > binaries are read in basically random fashion. Even if one orders the
> > binary optimally, throughput is still suboptimal due to the puny readahead.
> >
> > IO Hints:
> > Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> > reads and a binary that tends to take 110 page faults(ie program stops
> > execution and waits for disk) can be reduced down to 6. This has the
> > potential to double application startup of large apps without any clear
> > downsides.
> >
> > Suse ships their glibc with a dynamic linker patch to fadvise()
> > dynamic libraries(not sure why they switched from doing madvise
> > before).

This is interesting. I wonder how SuSE implements the policy.
Do you have the patch or some strace output that demonstrates the
fadvise() call?

> > I filed a glibc bug about this at
> > http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> > with his concern about wasting memory resources. What is the impact of
> > madvise(WILLNEED) or the fadvise equivalent on systems under memory
> > pressure? Does the kernel simply start ignoring these hints?
>
> It will throttle based on memory pressure. In idle situations it will
> eat your file cache, however, to satisfy the request.
>
> Now, the file cache should be much bigger than the amount of unneeded
> pages you prefault with the hint over the whole library, so I guess the
> benefit of prefaulting the right pages outweighs the downside of evicting
> some cache for unused library pages.
>
> Still, it's a workaround for deficits in the demand-paging/readahead
> heuristics and thus a bit ugly, I feel. Maybe Wu can help.

Program page faults are inherently random, so the straightforward
solution would be to increase the mmap read-around size (for desktops
with reasonable large memory), rather than to improve program layout
or readahead heuristics :)

> > Also, once an application is started is it reasonable to keep it
> > madvise(WILLNEED)ed or should the madvise flags be reset?
>
> It's a one-time operation that starts immediate readahead, no permanent
> changes are done.

Right. The kernel regard WILLNEED as a readahead request from userspace.

> > Perhaps the kernel could monitor the page-in patterns to increase the
> > readahead sizes? This may already happen, I've noticed that a handful of
> > pagefaults trigger > 131072bytes of IO, perhaps this just needs tweaking.
>
> CCd the man :-)

Thank you :)

Cheers,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Taras Glek on 6 Apr 2010 23:00

On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> Hi Taras,
>
> On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
>
>> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
>>
>>> Hello,
>>> I am working on improving Mozilla startup times. It turns out that page
>>> faults(caused by lack of cooperation between user/kernelspace) are the
>>> main cause of slow startup. I need some insights from someone who
>>> understands linux vm behavior.
>>>
> How about improve Fedora (and other distros) to preload Mozilla (and
> other apps the user run at the previous boot) with fadvise() at boot
> time? This sounds like the most reasonable option.
>
That's a slightly different usecase. I'd rather have all large apps
startup as efficiently as possible without any hacks. Though until we
get there, we'll be using all of the hacks we can.
> As for the kernel readahead, I have a patchset to increase default
> mmap read-around size from 128kb to 512kb (except for small memory
> systems). This should help your case as well.
>
Yes. Is the current readahead really doing read-around(ie does it read
pages before the one being faulted)? From what I've seen, having the
dynamic linker read binary sections backwards causes faults.
http://sourceware.org/bugzilla/show_bug.cgi?id=11447
>
>>> Current Situation:
>>> The dynamic linker mmap()s executable and data sections of our
>>> executable but it doesn't call madvise().
>>> By default page faults trigger 131072byte reads. To make matters worse,
>>> the compile-time linker + gcc lay out code in a manner that does not
>>> correspond to how the resulting executable will be executed(ie the
>>> layout is basically random). This means that during startup 15-40mb
>>> binaries are read in basically random fashion. Even if one orders the
>>> binary optimally, throughput is still suboptimal due to the puny readahead.
>>>
>>> IO Hints:
>>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
>>> reads and a binary that tends to take 110 page faults(ie program stops
>>> execution and waits for disk) can be reduced down to 6. This has the
>>> potential to double application startup of large apps without any clear
>>> downsides.
>>>
>>> Suse ships their glibc with a dynamic linker patch to fadvise()
>>> dynamic libraries(not sure why they switched from doing madvise
>>> before).
>>>
> This is interesting. I wonder how SuSE implements the policy.
> Do you have the patch or some strace output that demonstrates the
> fadvise() call?
>
glibc-2.3.90-ld.so-madvise.diff in
http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:15:0:

As I recall they just fadvise the filedescriptor before accessing it.
>
>>> I filed a glibc bug about this at
>>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
>>> with his concern about wasting memory resources. What is the impact of
>>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
>>> pressure? Does the kernel simply start ignoring these hints?
>>>
>> It will throttle based on memory pressure. In idle situations it will
>> eat your file cache, however, to satisfy the request.
>>
>> Now, the file cache should be much bigger than the amount of unneeded
>> pages you prefault with the hint over the whole library, so I guess the
>> benefit of prefaulting the right pages outweighs the downside of evicting
>> some cache for unused library pages.
>>
>> Still, it's a workaround for deficits in the demand-paging/readahead
>> heuristics and thus a bit ugly, I feel. Maybe Wu can help.
>>
> Program page faults are inherently random, so the straightforward
> solution would be to increase the mmap read-around size (for desktops
> with reasonable large memory), rather than to improve program layout
> or readahead heuristics :)
>
Program page faults may exhibit random behavior once they've started.

During startup page-in pattern of over-engineered OO applications is
very predictable. Programs are laid out based on compilation units,
which have no relation to how they are executed. Another problem is that
any large old application will have lots of code that is either rarely
executed or completely dead. Random sprinkling of live code among mostly
unneeded code is a problem.
I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB
with proper binary layout. Even if one lays out a program wrongly, the
worst-case pagein pattern will be pretty similar to what it is by default.

But yes, I completely agree that it would be awesome to increase the
readahead size proportionally to available memory. It's a little silly
to be reading tens of megabytes in 128kb increments :) You rock for
trying to modernize this.

>
>>> Also, once an application is started is it reasonable to keep it
>>> madvise(WILLNEED)ed or should the madvise flags be reset?
>>>
>> It's a one-time operation that starts immediate readahead, no permanent
>> changes are done.
>>
> Right. The kernel regard WILLNEED as a readahead request from userspace.
>
>
>>> Perhaps the kernel could monitor the page-in patterns to increase the
>>> readahead sizes? This may already happen, I've noticed that a handful of
>>> pagefaults trigger> 131072bytes of IO, perhaps this just needs tweaking.
>>>
>> CCd the man :-)
>>
> Thank you :)
>
> Cheers,
> Fengguang
>

Cheers,
Taras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: CONFIDENTIAL
Next: asm-generic: add NEED_SG_DMA_LENGTH to define sg_dma_len()