OOM killer, page fault [Kernel]

Prev: tracing: Fix to use unused attribute
Next: [PATCH 2/2] pci: pciehp update the slot bridge res to get big range for pcie devices - v8

From: Minchan Kim on 2 Nov 2009 18:40

Hi, Hugh.

On Mon, 2 Nov 2009 16:26:56 +0000 (GMT)
Hugh Dickins <hugh.dickins(a)tiscali.co.uk> wrote:

> On Mon, 2 Nov 2009, Minchan Kim wrote:
> > On Mon, 2 Nov 2009 14:02:16 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> > >
> > > Maybe some code returns VM_FAULT_OOM by mistake and pagefault_oom_killer()
> > > is called. digging mm/memory.c is necessary...
> > >
> > > I wonder why...now is this code
> > > ===
> > > static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > unsigned long address, pte_t *page_table, pmd_t *pmd,
> > > unsigned int flags, pte_t orig_pte)
> > > {
> > > pgoff_t pgoff;
> > >
> > > flags |= FAULT_FLAG_NONLINEAR;
> > >
> > > if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> > > return 0;
> > >
> > > if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
> > > /*
> > > * Page table corrupted: show pte and kill process.
> > > */
> > > print_bad_pte(vma, address, orig_pte, NULL);
> > > return VM_FAULT_OOM;
> > > }
> > >
> > > pgoff = pte_to_pgoff(orig_pte);
> > > return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> > > }
> > > ==
> > > Then, OOM...is this really OOM ?
> >
> > It seems that the goal is to kill process by OOM trick as comment said.
> >
> > I found It results from Hugh's commit 65500d234e74fc4e8f18e1a429bc24e51e75de4a.
> > I think it's not a real OOM.
> >
> > BTW, If it is culpit in this case, print_bad_pte should have remained any log. :)
>
> Yes, the chances are that this is not related to Norbert's problem.
> But thank you for reminding me of that not-very-nice hack of mine.
>
> It was kind-of valid at the time that I wrote it (2.6.15), when
> VM_FAULT_OOM did kill the faulting process. But since then the fault
> path has rightly been changed (in x86 at least, I didn't check the rest)
> to let the OOM killer decide who to kill: so now there's a danger that
> a pagetable corruption there will instead kill some unrelated process.
>
> Being lazy, I'm inclined simply to change that to VM_FAULT_SIGBUS now:
> which doesn't actually guarantee that the process will be killed, but
> should be better than just repeatedly re-faulting on the entry. (I
> don't much want to SIGKILL current since mm might not be current's.)
>
> That aberrant use of VM_FAULT_OOM has recently been copied into
> do_swap_page() (the first instance; the second instance is right -
> hmm, well, the second instance is normally right, but I guess it
> also covers pagetable corruption cases which we can't distinguish
> there; oh well) and should be corrected there too.
>
> Does VM_FAULT_SIGBUS sound good enough to you?

I am Okay.
First of all, we have to prevent innocent process killing.
Second, although it returns SIGBUS, we can distinguish it from normal SIGBUS
by bad pte log.
Third, we don't want to add new VM_FAULT_XXX as possible as. :)

>
> Hugh

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Norbert Preining on 5 Nov 2009 12:40

Hi Kim,

> > sorry for the late reply. I have two news, one good and one bad: The good
> > being that I can reproduce the bug by running VirtualBox with some W7
>
> W7 means "Windows 7"?

Yes, sorry for the shorthand.

> > I know it sounds completely crazy, the patch only does harmless things
> > afais. But I tried it. Several times. rc6+patch never did boot, while
> > rc5 without path did boot. Then I patched it into -rc5, recompiled, and
> > boom, no boot. booting into .31.5, recompiling rc6 and rc5 without
> > that patch and suddenly rc6 boots (and I am sure rc5, too).
>
> Hmm. It's out of my knowledge.
> Probably, It's because WARN_ON?
> Could you try it with omitting WARN_ON, again?

Will do that.

> > Ah yes, I can reproduce the original strange bug with oom killer!
>
> Sounds good to me.
> Could you tell me your test scenario, your system info(CPU, RAM) and
> config?
> I want to reproduce it in my mahchine to not bother you. :)

Puhh, well, I meant "I could reproduce it", but not "I have a clear
idea what steps to be taken to reproduce it" ;-) Well here is what I can
tell you:
actual hardware:
Intel(R) Core(TM)2 Duo CPU P9500
Memory 2G
Config of my kernel attached.

Virtual Machine (VirtualBox, not the OSE variant, I need USB 2.0 support
for GPS stuff):
VirtualBox 3.0.10
memory for the machine: 1G (50%)
ACPI and IO/APIC turned on
1 processor with PAE/NX
VT-x and Nested Paging activated
Display 128M
(need more details?)

I will remove the WARN_ON and reboot and see if that works. If yes I try
to recreate the problem.

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology preining(a)jaist.ac.jp
Vienna University of Technology preining(a)logic.at
Debian Developer (Debian TeX Task Force) preining(a)debian.org
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
BROMSGROVE
Any urban environment containing a small amount of dogturd and about
forty-five tons of bent steel pylon or a lump of concrete with holes
claiming to be sculpture. 'Oh, come my dear, and come with me. And
wander 'neath the bromsgrove tree' - Betjeman.
--- Douglas Adams, The Meaning of Liff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Norbert Preining on 5 Nov 2009 12:40

On Do, 05 Nov 2009, preining wrote:
> > Hmm. It's out of my knowledge.
> > Probably, It's because WARN_ON?
> > Could you try it with omitting WARN_ON, again?
>
> Will do that.

No change, still hangs. But at least I see now that it is not hanging
at an arbitrary position, but it does not start the init process. It
stops right before the "Calling init" or similar-

BTW, this time the config is really attached.

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology preining(a)jaist.ac.jp
Vienna University of Technology preining(a)logic.at
Debian Developer (Debian TeX Task Force) preining(a)debian.org
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
WEEM (n.)
The tools with which a dentist can inflict the greatest
pain. Formerly, which tool this was dependent upon the imagination and
skill of the individual dentist, though now, with technological
advances, weems can be bought specially.
--- Douglas Adams, The Meaning of Liff

From: Minchan Kim on 5 Nov 2009 12:40

Hi.

On Thu, Nov 5, 2009 at 10:21 PM, Norbert Preining <preining(a)logic.at> wrote:
> Hi Kim, hi all,
>
> (still please Cc)
>
> sorry for the late reply. I have two news, one good and one bad: The good
> being that I can reproduce the bug by running VirtualBox with some W7

W7 means "Windows 7"?

> within. Anyway, I don't have a trace or better debug due to the bad news:
> Both 2.6.32-rc5 and 2.6.32-rc6 do *not* boot with the patch below.
> Don't ask me why, please, and I don't have a serial/net console so that
> I can tell you more, but the booting hangs badly at:
> [ � �6.657492] usb 4-1: Product: Globetrotter HSDPA Modem
> [ � �6.657494] usb 4-1: Manufacturer: Option N.V.
> [ � �6.657496] usb 4-1: SerialNumber: Serial Number
> [ � �6.657558] usb 4-1: configuration #1 chosen from 1 choice
> [ � �6.837364] input: PS/2 Mouse as /devices/platform/i8042/serio2/input/input6
> [ � �6.853693] input: AlpsPS/2 ALPS GlidePoint as /devices/platform/i8042/serio2/input/input7
>
> Normally it continues like that, but with the patch below it hangs here
> and does not continue. I need to Sysrq-s/u/b out of it.
>
> [ � �6.904119] usb 8-2: new full speed USB device using uhci_hcd and address 2
> [ � �7.075524] usb 8-2: New USB device found, idVendor=044e, idProduct=3017
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7e91b5f..47e4b15 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>> � � � �vmf.page = NULL;
>>
>> � � � �ret = vma->vm_ops->fault(vma, &vmf);
>> - � � � if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
>> + � � � if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
>> + � � � � � � � printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n",
>> vma->vm_ops->fault);
>> + � � � � � � � WARN_ON(1);
>> +
>> + � � � }
>> � � � � � � � �return ret;
>>
>> � � � �if (unlikely(PageHWPoison(vmf.page))) {
>
> I know it sounds completely crazy, the patch only does harmless things
> afais. But I tried it. Several times. rc6+patch never did boot, while
> rc5 without path did boot. Then I patched it into -rc5, recompiled, and
> boom, no boot. booting into .31.5, recompiling rc6 and rc5 without
> that patch and suddenly rc6 boots (and I am sure rc5, too).

Hmm. It's out of my knowledge.
Probably, It's because WARN_ON?
Could you try it with omitting WARN_ON, again?

>
> Sorry that I cannot give more infos, please let me know what else I can
> do.

Thanks for your time :)

> Ah yes, I can reproduce the original strange bug with oom killer!

Sounds good to me.
Could you tell me your test scenario, your system info(CPU, RAM) and
config?
I want to reproduce it in my mahchine to not bother you. :)

>
> Best wishes
>
> Norbert
>
> -------------------------------------------------------------------------------
> Dr. Norbert Preining � � � � � � � � � � � � � � � � � � � �Associate Professor
> JAIST Japan Advanced Institute of Science and Technology � preining(a)jaist.ac.jp
> Vienna University of Technology � � � � � � � � � � � � � � � preining(a)logic.at
> Debian Developer (Debian TeX Task Force) � � � � � � � � � �preining(a)debian.org
> gpg DSA: 0x09C5B094 � � �fp: 14DF 2E6C 0307 BE6D AD76 �A9C0 D2BF 4AA3 09C5 B094
> -------------------------------------------------------------------------------
> MELTON CONSTABLE (n.)
> A patent anti-wrinkle cream which policemen wear to keep themselves
> looking young.
> � � � � � � � � � � � �--- Douglas Adams, The Meaning of Liff
>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Norbert Preining on 5 Nov 2009 12:40

Hi Kim, hi all,

(still please Cc)

sorry for the late reply. I have two news, one good and one bad: The good
being that I can reproduce the bug by running VirtualBox with some W7
within. Anyway, I don't have a trace or better debug due to the bad news:
Both 2.6.32-rc5 and 2.6.32-rc6 do *not* boot with the patch below.
Don't ask me why, please, and I don't have a serial/net console so that
I can tell you more, but the booting hangs badly at:
[ 6.657492] usb 4-1: Product: Globetrotter HSDPA Modem
[ 6.657494] usb 4-1: Manufacturer: Option N.V.
[ 6.657496] usb 4-1: SerialNumber: Serial Number
[ 6.657558] usb 4-1: configuration #1 chosen from 1 choice
[ 6.837364] input: PS/2 Mouse as /devices/platform/i8042/serio2/input/input6
[ 6.853693] input: AlpsPS/2 ALPS GlidePoint as /devices/platform/i8042/serio2/input/input7

Normally it continues like that, but with the patch below it hangs here
and does not continue. I need to Sysrq-s/u/b out of it.

[ 6.904119] usb 8-2: new full speed USB device using uhci_hcd and address 2
[ 7.075524] usb 8-2: New USB device found, idVendor=044e, idProduct=3017

> diff --git a/mm/memory.c b/mm/memory.c
> index 7e91b5f..47e4b15 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm,
> struct vm_area_struct *vma,
> vmf.page = NULL;
>
> ret = vma->vm_ops->fault(vma, &vmf);
> - if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> + printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n",
> vma->vm_ops->fault);
> + WARN_ON(1);
> +
> + }
> return ret;
>
> if (unlikely(PageHWPoison(vmf.page))) {

I know it sounds completely crazy, the patch only does harmless things
afais. But I tried it. Several times. rc6+patch never did boot, while
rc5 without path did boot. Then I patched it into -rc5, recompiled, and
boom, no boot. booting into .31.5, recompiling rc6 and rc5 without
that patch and suddenly rc6 boots (and I am sure rc5, too).

Sorry that I cannot give more infos, please let me know what else I can
do.

Ah yes, I can reproduce the original strange bug with oom killer!

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology preining(a)jaist.ac.jp
Vienna University of Technology preining(a)logic.at
Debian Developer (Debian TeX Task Force) preining(a)debian.org
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
MELTON CONSTABLE (n.)
A patent anti-wrinkle cream which policemen wear to keep themselves
looking young.
--- Douglas Adams, The Meaning of Liff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: tracing: Fix to use unused attribute
Next: [PATCH 2/2] pci: pciehp update the slot bridge res to get big range for pcie devices - v8