From: Hugh Dickins on
On Thu, 29 Jul 2010, KAMEZAWA Hiroyuki wrote:
> On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:
>
> > Can you please add explicit commenting in the code?
> >
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
>
> At hibernation, all pages-should-be-saved are written into a image (here, swap).
> Then, swap_map[], memmap etcs are also saved into disks.
>
> But, swap allocation happens one by one. So, the final image of swap_map[] is
> different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
> changes page's state while assiging swap. Because memory can be modified in
> hibernation is only not-to-be-save memory. it's a breakage.
>
> This patch fixes it by disabling swap entry reuse at hibernation.
>
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
> ---
> mm/swapfile.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.34.org/mm/swapfile.c
> ===================================================================
> --- linux-2.6.34.org.orig/mm/swapfile.c
> +++ linux-2.6.34.org/mm/swapfile.c
> @@ -315,8 +315,15 @@ checks:
> if (offset > si->highest_bit)
> scan_base = offset = si->lowest_bit;
>
> - /* reuse swap entry of cache-only swap if not busy. */
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + /*
> + * reuse swap entry of cache-only swap if not busy &&
> + * when we're called via pageout(). At hibernation, swap-reuse
> + * is harmful because it changes memory status...which may
> + * be saved already.
> + */
> + if (vm_swap_full()
> + && usage == SWAP_HAS_CACHE
> + && si->swap_map[offset] == SWAP_HAS_CACHE) {
> int swap_was_freed;
> spin_unlock(&swap_lock);
> swap_was_freed = __try_to_reclaim_swap(si, offset);
>
> --

KAMEZAWA-San, that is a brilliant realization, I salute you.

So, in between snapshotting the image and actually hibernating, we have
two parallel universes, one frozen in the image, the other writing that
out to swap: with the danger that the latter (which is about to die)
will introduce fatal inconsistencies in the former by placing pages in
swap locations innocently reallocated from it. (Excuse me while I go
write the movie script.)

I'd never got to think about that before.

Your fix is neat though hacky (it's somewhat of a coincidence that the
usage arg happens to distinguish the hibernation case), and should be
enough to fix "your" regression, but is it really enough?

I'm worrying about the try_to_free_swap() calls in shrink_page_list():
can't those get called from direct reclaim (even if GFP_NOIO), and can't
direct reclaim get invoked from somewhere in the I/O path below
swap_writepage(), used for writing out the hibernation image?

Direct reclaim because kswapd does set_freezable(), so should itself
be out of the picture. But we cannot freeze writing the hibernation
image, and its occasional need for memory, so maybe a different approach
is required.

I've CC'ed Andrea because we were having an offline conversation about
whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
if this swap bug underlies his interest, though he was mainly worrying
about I/O in progress.

Despite reading Documentation/power/freezing-of-tasks.txt, I have no
clear idea of what really needs freezing, and whether freezing can
fully handle the issues. Rafael, please can you advise?

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
Hi everyone,

On Thu, Jul 29, 2010 at 11:44:31AM -0700, Hugh Dickins wrote:
> I've CC'ed Andrea because we were having an offline conversation about
> whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> if this swap bug underlies his interest, though he was mainly worrying
> about I/O in progress.

My opinion is that with current freezer model it is needed for suspend
to disk. The kthread_should_stop seems useless for kswapd/ksmd, but
I'm afraid it might get useful in the future so just to stay on the
safe side I added it to khugepaged as well, but it's contributing to
the pollution.

I've no idea why the freezing isn't preemptive (through the scheduler,
all these kernel threads are obviously lowlatency beasts) by default
(if kthread doesn't run set_freezable) and instead it requires
cooperative freezing. Now I can imagine a kernel thread not being
happy about preemptive freezing, and I also can imagine a kernel
thread not being happy about being frozen. But I think the default
shall be "preempting freezing by default" (removing all those
set_freezable/try_to_freeze and furthermore reducing the latency it
takes to freeze the task without having to add !freezing(current)
checks in the loops, by taking advantage of the existing cond_resched
or preempt=Y) and then introduce a set_freezable_cooperative() if it
wants to call try_to_freeze in a cooperative way in a well defined
point of the code, or set_not_freezable() if it doesn't want to
be frozen.

But for now I'm afraid the below is needed (only ksm.c part applies to
upstream).

------
Subject: freeze khugepaged and ksmd

From: Andrea Arcangeli <aarcange(a)redhat.com>

It's unclear why schedule friendly kernel threads can't be taken away by the
CPU through the scheduler itself. It's safer to stop them as they can trigger
memory allocation, if kswapd also freezes itself to avoid generating I/O they
have too.

Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -15,6 +15,7 @@
#include <linux/mm_inline.h>
#include <linux/kthread.h>
#include <linux/khugepaged.h>
+#include <linux/freezer.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -2053,6 +2054,9 @@ static void khugepaged_do_scan(struct pa
break;
#endif

+ if (unlikely(kthread_should_stop() || freezing(current)))
+ break;
+
spin_lock(&khugepaged_mm_lock);
if (!khugepaged_scan.mm_slot)
pass_through_head++;
@@ -2115,6 +2119,9 @@ static void khugepaged_loop(void)
if (hpage)
put_page(hpage);
#endif
+ try_to_freeze();
+ if (unlikely(kthread_should_stop()))
+ break;
if (khugepaged_has_work()) {
DEFINE_WAIT(wait);
if (!khugepaged_scan_sleep_millisecs)
@@ -2134,6 +2141,7 @@ static int khugepaged(void *none)
{
struct mm_slot *mm_slot;

+ set_freezable();
set_user_nice(current, 19);

/* serialize with start_khugepaged() */
@@ -2148,6 +2156,8 @@ static int khugepaged(void *none)
mutex_lock(&khugepaged_mutex);
if (!khugepaged_enabled())
break;
+ if (unlikely(kthread_should_stop()))
+ break;
}

spin_lock(&khugepaged_mm_lock);
diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -33,6 +33,7 @@
#include <linux/mmu_notifier.h>
#include <linux/swap.h>
#include <linux/ksm.h>
+#include <linux/freezer.h>

#include <asm/tlbflush.h>
#include "internal.h"
@@ -1386,7 +1387,7 @@ static void ksm_do_scan(unsigned int sca
struct rmap_item *rmap_item;
struct page *uninitialized_var(page);

- while (scan_npages--) {
+ while (scan_npages-- && likely(!freezing(current))) {
cond_resched();
rmap_item = scan_get_next_rmap_item(&page);
if (!rmap_item)
@@ -1412,6 +1413,8 @@ static int ksm_scan_thread(void *nothing
ksm_do_scan(ksm_thread_pages_to_scan);
mutex_unlock(&ksm_thread_mutex);

+ try_to_freeze();
+
if (ksmd_should_run()) {
schedule_timeout_interruptible(
msecs_to_jiffies(ksm_thread_sleep_millisecs));
@@ -1953,6 +1956,7 @@ static int __init ksm_init(void)
struct task_struct *ksm_thread;
int err;

+ set_freezable();
err = ksm_slab_init();
if (err)
goto out;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Thursday, July 29, 2010, Hugh Dickins wrote:
> On Thu, 29 Jul 2010, KAMEZAWA Hiroyuki wrote:
> > On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:
> >
> > > Can you please add explicit commenting in the code?
> > >
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
> >
> > At hibernation, all pages-should-be-saved are written into a image (here, swap).
> > Then, swap_map[], memmap etcs are also saved into disks.
> >
> > But, swap allocation happens one by one. So, the final image of swap_map[] is
> > different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
> > changes page's state while assiging swap. Because memory can be modified in
> > hibernation is only not-to-be-save memory. it's a breakage.
> >
> > This patch fixes it by disabling swap entry reuse at hibernation.
> >
> >
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
> > ---
> > mm/swapfile.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6.34.org/mm/swapfile.c
> > ===================================================================
> > --- linux-2.6.34.org.orig/mm/swapfile.c
> > +++ linux-2.6.34.org/mm/swapfile.c
> > @@ -315,8 +315,15 @@ checks:
> > if (offset > si->highest_bit)
> > scan_base = offset = si->lowest_bit;
> >
> > - /* reuse swap entry of cache-only swap if not busy. */
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + /*
> > + * reuse swap entry of cache-only swap if not busy &&
> > + * when we're called via pageout(). At hibernation, swap-reuse
> > + * is harmful because it changes memory status...which may
> > + * be saved already.
> > + */
> > + if (vm_swap_full()
> > + && usage == SWAP_HAS_CACHE
> > + && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > int swap_was_freed;
> > spin_unlock(&swap_lock);
> > swap_was_freed = __try_to_reclaim_swap(si, offset);
> >
> > --
>
> KAMEZAWA-San, that is a brilliant realization, I salute you.
>
> So, in between snapshotting the image and actually hibernating, we have
> two parallel universes, one frozen in the image, the other writing that
> out to swap: with the danger that the latter (which is about to die)
> will introduce fatal inconsistencies in the former by placing pages in
> swap locations innocently reallocated from it. (Excuse me while I go
> write the movie script.)
>
> I'd never got to think about that before.
>
> Your fix is neat though hacky (it's somewhat of a coincidence that the
> usage arg happens to distinguish the hibernation case), and should be
> enough to fix "your" regression, but is it really enough?
>
> I'm worrying about the try_to_free_swap() calls in shrink_page_list():
> can't those get called from direct reclaim (even if GFP_NOIO), and can't
> direct reclaim get invoked from somewhere in the I/O path below
> swap_writepage(), used for writing out the hibernation image?
>
> Direct reclaim because kswapd does set_freezable(), so should itself
> be out of the picture. But we cannot freeze writing the hibernation
> image, and its occasional need for memory, so maybe a different approach
> is required.

We preallocate memory for the I/O related to the writing of hibernation
images, so swapping things out shouldn't be necessary at that time just for
this purpose. The I/O itself is done directly at the block layer level and it
will never be frozen, because it is done by the hibernate process itself.

> I've CC'ed Andrea because we were having an offline conversation about
> whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> if this swap bug underlies his interest, though he was mainly worrying
> about I/O in progress.
>
> Despite reading Documentation/power/freezing-of-tasks.txt, I have no
> clear idea of what really needs freezing, and whether freezing can
> fully handle the issues. Rafael, please can you advise?

Well, the rule of thumb (if it can be called this way) is that the contents of
the image has to be consistent with whatever is stored in permanent storage.
So, for example, filesystems that were mounted before creating the image
cannot be modified until the image is restored. Consequently, if there are
any kernel threads that might cause that to happen, they need to be frozen.

Now, if I understand it correctly, the failure mode is that certain page had
been swapped out before the image was created and then it was swapped in
while we were writing the image out and the slot occupied by it was re-used.
In that case the image would contain wrong information on the state of the
swap and that would result in wrong data being loaded into memory on an attempt
to swap that page in after resume.

So, generally, we have to avoid doing things that would result in swapping
memory pages out after we have created a hibernation image. If that can be
achieved by freezing certain kernel threads, that probably is the simplest
approach.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Thursday, July 29, 2010, Andrea Arcangeli wrote:
> Hi everyone,
>
> On Thu, Jul 29, 2010 at 11:44:31AM -0700, Hugh Dickins wrote:
> > I've CC'ed Andrea because we were having an offline conversation about
> > whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> > if this swap bug underlies his interest, though he was mainly worrying
> > about I/O in progress.
>
> My opinion is that with current freezer model it is needed for suspend
> to disk. The kthread_should_stop seems useless for kswapd/ksmd, but
> I'm afraid it might get useful in the future so just to stay on the
> safe side I added it to khugepaged as well, but it's contributing to
> the pollution.
>
> I've no idea why the freezing isn't preemptive (through the scheduler,
> all these kernel threads are obviously lowlatency beasts) by default
> (if kthread doesn't run set_freezable) and instead it requires
> cooperative freezing.

This is a result of several discussions on the LKML. :-) Basically, the vast
majority of kernel threads need not be frozen and attempting to freeze them
all introduces race conditions that are extremely diffucult to resolve (eg.
with multithread workqueues).

> Now I can imagine a kernel thread not being
> happy about preemptive freezing, and I also can imagine a kernel
> thread not being happy about being frozen. But I think the default
> shall be "preempting freezing by default" (removing all those
> set_freezable/try_to_freeze and furthermore reducing the latency it
> takes to freeze the task without having to add !freezing(current)
> checks in the loops, by taking advantage of the existing cond_resched
> or preempt=Y) and then introduce a set_freezable_cooperative() if it
> wants to call try_to_freeze in a cooperative way in a well defined
> point of the code, or set_not_freezable() if it doesn't want to
> be frozen.

I'm afraid that would be difficult to achieve in general. Besides, there's
no reason why kernel threads that need not be frozen should care about the
freezing thing at all. It's much simpler to require the ones that need to be
frozen to cooperate.

> But for now I'm afraid the below is needed (only ksm.c part applies to
> upstream).

Looks good to me.

Can you please prepare a patch against mainline for Ondrej to try?

Thanks,
Rafael


> ------
> Subject: freeze khugepaged and ksmd
>
> From: Andrea Arcangeli <aarcange(a)redhat.com>
>
> It's unclear why schedule friendly kernel threads can't be taken away by the
> CPU through the scheduler itself. It's safer to stop them as they can trigger
> memory allocation, if kswapd also freezes itself to avoid generating I/O they
> have too.
>
> Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
> ---
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -15,6 +15,7 @@
> #include <linux/mm_inline.h>
> #include <linux/kthread.h>
> #include <linux/khugepaged.h>
> +#include <linux/freezer.h>
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
> #include "internal.h"
> @@ -2053,6 +2054,9 @@ static void khugepaged_do_scan(struct pa
> break;
> #endif
>
> + if (unlikely(kthread_should_stop() || freezing(current)))
> + break;
> +
> spin_lock(&khugepaged_mm_lock);
> if (!khugepaged_scan.mm_slot)
> pass_through_head++;
> @@ -2115,6 +2119,9 @@ static void khugepaged_loop(void)
> if (hpage)
> put_page(hpage);
> #endif
> + try_to_freeze();
> + if (unlikely(kthread_should_stop()))
> + break;
> if (khugepaged_has_work()) {
> DEFINE_WAIT(wait);
> if (!khugepaged_scan_sleep_millisecs)
> @@ -2134,6 +2141,7 @@ static int khugepaged(void *none)
> {
> struct mm_slot *mm_slot;
>
> + set_freezable();
> set_user_nice(current, 19);
>
> /* serialize with start_khugepaged() */
> @@ -2148,6 +2156,8 @@ static int khugepaged(void *none)
> mutex_lock(&khugepaged_mutex);
> if (!khugepaged_enabled())
> break;
> + if (unlikely(kthread_should_stop()))
> + break;
> }
>
> spin_lock(&khugepaged_mm_lock);
> diff --git a/mm/ksm.c b/mm/ksm.c
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -33,6 +33,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/swap.h>
> #include <linux/ksm.h>
> +#include <linux/freezer.h>
>
> #include <asm/tlbflush.h>
> #include "internal.h"
> @@ -1386,7 +1387,7 @@ static void ksm_do_scan(unsigned int sca
> struct rmap_item *rmap_item;
> struct page *uninitialized_var(page);
>
> - while (scan_npages--) {
> + while (scan_npages-- && likely(!freezing(current))) {
> cond_resched();
> rmap_item = scan_get_next_rmap_item(&page);
> if (!rmap_item)
> @@ -1412,6 +1413,8 @@ static int ksm_scan_thread(void *nothing
> ksm_do_scan(ksm_thread_pages_to_scan);
> mutex_unlock(&ksm_thread_mutex);
>
> + try_to_freeze();
> +
> if (ksmd_should_run()) {
> schedule_timeout_interruptible(
> msecs_to_jiffies(ksm_thread_sleep_millisecs));
> @@ -1953,6 +1956,7 @@ static int __init ksm_init(void)
> struct task_struct *ksm_thread;
> int err;
>
> + set_freezable();
> err = ksm_slab_init();
> if (err)
> goto out;
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Thu, 29 Jul 2010 11:44:31 -0700 (PDT)
Hugh Dickins <hughd(a)google.com> wrote:

> So, in between snapshotting the image and actually hibernating, we have
> two parallel universes, one frozen in the image, the other writing that
> out to swap: with the danger that the latter (which is about to die)
> will introduce fatal inconsistencies in the former by placing pages in
> swap locations innocently reallocated from it. (Excuse me while I go
> write the movie script.)
>
> I'd never got to think about that before.
>
> Your fix is neat though hacky (it's somewhat of a coincidence that the
> usage arg happens to distinguish the hibernation case), and should be
> enough to fix "your" regression, but is it really enough?
>
yes, it's hacky and this just hides a problem caused by the patch.

> I'm worrying about the try_to_free_swap() calls in shrink_page_list():
> can't those get called from direct reclaim (even if GFP_NOIO), and can't
> direct reclaim get invoked from somewhere in the I/O path below
> swap_writepage(), used for writing out the hibernation image?
>
IIUC, before hibernation, shirnk_memory() is called and hibernation codes
expect there are enough memory. But you're right, there are no guarantee.

I think the best way is kexec(). But maybe rollback from hibernation failure
will be difficult. Considering how crash-dump works well and under maintainance
by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
reuse of kdump code, ...or, hibernation-resume code can eat kdump image
directly. Maybe the problem will be the speed of dump.

Otherwise, we need to take care of status changes in already-saved-memory.
Maybe anyone don't like to add COW method...

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/