From: Tony Luck on
> I've fixed it (by reversing the order of those lines) for tomorrow's
> linux-next.

Somewhere between next-20100602 and next-20100604
something was changed that results in ia64 taking deref
NULL oops in sys_init_module() ...

Freeing unused kernel memory: 1984kB freed
modprobe[1851]: NaT consumption 2216203124768 [1]
Modules linked in:

Pid: 1851, CPU 2, comm: modprobe
psr : 0000121008526030 ifs : 8000000000000794 ip :
[<a0000001000f0f31>] Not tainted
(2.6.35-rc1-generic-smp-next-20100604)
ip is at sys_init_module+0x131/0x420

At the point of dereference it looks like we were trying
to load a 4-byte data object from offset 552 into the
"struct module *" that wa returned by load_module().

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Fri, 4 Jun 2010, Tony Luck wrote:
>
> At the point of dereference it looks like we were trying
> to load a 4-byte data object from offset 552 into the
> "struct module *" that wa returned by load_module().

Sounds like 'mod->num_ctors' loaded by do_mod_ctors(). It's a 4-byte field
in roughly that area.

What does a NaT consumption fault mean, and does it give the invalid
address it was loaded off? In the successful path of "load_module()", we
will have dereferenced the "mod" pointer we return just before, so I
wonder if there's some error case that incorrectly returns a positive
errno instead of a negative one, and causes us to miss the "IS_ERR()"
check or something.

There's a couple of checking routines in module.c that do not return a
negative error, but instead return 0/1. The one I looked at was converted
into a negative error, but there are several cases of

if (err)
return ERR_PTR(err)

and if something does that on a 0/1 value, it will return a bogus pointer.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Luck, Tony on
> What does a NaT consumption fault mean, and does it give the invalid
> address it was loaded off?

This almost always means that we dereferenced a NULL pointer ... though
any access into the bottom PAGE_SIZE of kernel virtual address space
will result in this trap. This happens on ia64 because we have a "NaT"
page mapped at 0x0 so that speculative loads that chase NULL pointers
at the end of lists behave more rationally.

Sadly I don't have the actual address. The register that was used
for the dereference isn't included in the OOPS output.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Fri, 4 Jun 2010, Luck, Tony wrote:
>
> This almost always means that we dereferenced a NULL pointer ... though
> any access into the bottom PAGE_SIZE of kernel virtual address space
> will result in this trap. This happens on ia64 because we have a "NaT"
> page mapped at 0x0 so that speculative loads that chase NULL pointers
> at the end of lists behave more rationally.
>
> Sadly I don't have the actual address. The register that was used
> for the dereference isn't included in the OOPS output.

Ok, so it confirms just that load_module() has returned a pointer that is
either NULL or at least within PAGE_SIZE-552.

It could be a negative error pointer (and the offset of 552 turns it into
the NULL page), but that's what the whole IS_ERR() thing checks for, so
that's not the case.

So the

if (err)
return ERR_PTR(err);

case does seem pretty likely (most of them with a "goto <error-case>", but
some directly. Many of them have the stricter form of "if (err < 0)", but
there's a number that do not.

And in fact, I think I see the bad one:

/* Figure out module layout, and allocate all the memory. */
mod = layout_and_allocate(&info);
if (IS_ERR(mod))
goto free_copy;

which looks fine, but "free_copy:" expects the error number in "err",
which is what the other error cases do.

I think this was introduced by Rusty's commit 5d3f5be82944 ("module:
layout_and_allocate"), and here's a suggested fix.. The easiest fix is to
actually change the "free_copy" target to return "mod" as the above goto
expects, and then just do a conversion before the fall-through from the
other error cases (that have it in 'err').

Does this fix it? I stopped looking for other possible causes when I found
this one.

Linus

---
kernel/module.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 69a3f12..9a0b275 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2653,9 +2653,10 @@ static struct module *load_module(void __user *umod,
module_unload_free(mod);
free_module:
module_deallocate(mod, &info);
+ mod = ERR_PTR(err);
free_copy:
free_copy(&info);
- return ERR_PTR(err);
+ return mod;
}

/* Call module constructors. */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Luck, Tony on
> Does this fix it? I stopped looking for other possible causes when I found
> this one.

It gets rid of the oops. So that's good. Something is still
hokey in linux-next land though because no modules get loaded.
So no ehci/uhci available :-(

No obvious looking error messages on the console.

-Tony
---
kernel/module.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 69a3f12..9a0b275 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2653,9 +2653,10 @@ static struct module *load_module(void __user *umod,
module_unload_free(mod);
free_module:
module_deallocate(mod, &info);
+ mod = ERR_PTR(err);
free_copy:
free_copy(&info);
- return ERR_PTR(err);
+ return mod;
}

/* Call module constructors. */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/