From: Peter Zijlstra on
On Fri, 2010-04-16 at 23:45 +0800, r6144 wrote:
> Hello all,
>
> I'm having an annoying kernel bug regarding huge pages in Fedora 12:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=552257
>
> Basically I want to use huge pages in a multithreaded number crunching
> program, which happens to use process-shared semaphores (because fftw
> does it). The futex for the semaphore ends up lying on a huge page, and
> I then get an endless loop in get_futex_key(), apparently because the
> anonymous huge page containing the futex does not have a page->mapping.
> A test case is provided in the above link.
>
> I reported the bug to Fedora bugzilla months ago, but haven't received
> any feedback yet.

No, it works much better if you simply mail LKML and CC people who work
on the code in question ;-)

> The Fedora kernel is based on 2.6.32.11, and a
> cursory glance at the 2.6.34-rc3 source does not yield any relevant
> change.
>
> So, could anyone tell me if the current mainline kernel might act better
> in this respect, before I get around to compiling it?

Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
hugetlb pages don't have their page->mapping set.

I guess something like the below might work, but I'd really rather not
add hugetlb knowledge to futex.c. Does anybody else have a better idea?
Maybe create something similar to an anon_vma for hugetlb pages?

---
kernel/futex.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e7a35f1..b0f1b2d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -252,7 +252,7 @@ again:

page = compound_head(page);
lock_page(page);
- if (!page->mapping) {
+ if (!page->mapping && !PageHuge(page)) {
unlock_page(page);
put_page(page);
goto again;
@@ -265,7 +265,7 @@ again:
* it's a read-only handle, it's expected that futexes attach to
* the object not the particular process.
*/
- if (PageAnon(page)) {
+ if (PageAnon(page) || (PageHuge(page) && !page->mapping)) {
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-04-19 at 12:43 +0100, Mel Gorman wrote:

> > Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
> > hugetlb pages don't have their page->mapping set.
> >
> > I guess something like the below might work, but I'd really rather not
> > add hugetlb knowledge to futex.c. Does anybody else have a better idea?
> > Maybe create something similar to an anon_vma for hugetlb pages?
> >
>
> anon_vma for hugetlb pages sounds overkill, what would it gain? In this
> context, futex only appears to distinguish between whether the
> references are private or shared.
>
> Looking at the hugetlbfs code, I can't see a place where it actually cares
> about the mapping as such. It's used to find shared pages in the page cache
> (but not in the LRU) that are backed by the hugetlbfs file. For hugetlbfs
> though, the mapping is mostly kept in page->private for reservation accounting
> purposes.
>
> I can't think of other parts of the VM that touch the mapping if the
> page is managed by hugetlbfs so the following patch should also work but
> without futex having hugetlbfs-awareness. What do you think? Maybe for
> safety, it would be better to make the mapping some obvious poison bytes
> or'd with PAGE_MAPPING_ANON so an oops will be more obvious?

Yes, this seems perfectly adequate to me, that poison idea might be
worthwhile too :-)

Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6034dc9..57a5faa 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)
>
> mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> + page->mapping = NULL;
> BUG_ON(page_count(page));
> INIT_LIST_HEAD(&page->lru);
>
> @@ -2447,8 +2448,10 @@ retry:
> spin_lock(&inode->i_lock);
> inode->i_blocks += blocks_per_huge_page(h);
> spin_unlock(&inode->i_lock);
> - } else
> + } else {
> lock_page(page);
> + page->mapping = (struct address_space *)PAGE_MAPPING_ANON;
> + }
> }
>
> /*



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Mon, Apr 19, 2010 at 04:32:45PM +0100, Mel Gorman wrote:
> +#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)

Cute indeed.

BTW, just in case I tested this on transparent hugepage and it works
fine (it uses no cpu and can be killed with C^c). I had to hack it
like below to allocate the semaphore on hugepages without
khugepaged. I verified 1 hugepage is allocated (thanks to memory
compaction there's an huge excess of hugepages compared to what my
regular apps can eat ;). Furthermore I found gcc bypasses malloc so a
small patch to gcc should move it all into hugepages. Then maybe we
can build the kernel faster and definitely translate.o will build 8%
faster with the default khugepaged scan_sleep_millisecs settings
(waiting to be confirmed but I exclude bad surprises, whatever runs
fast with khugepaged has will run even faster without it if something,
or equal in the worst case).

Thanks,
Andrea

--- process-shared-sem-hugepage.c.orig 2010-04-19 17:43:47.278964888 +0200
+++ process-shared-sem-hugepage.c 2010-04-19 17:44:01.100032774 +0200
@@ -30,6 +30,7 @@ int main(void)

g_thread_init(NULL);
workers = g_new(GThread *, NWORKER); work_sem = g_new(sem_t, 1);
+ posix_memalign(&work_sem, 2*1024*1024, 2*1024*1024);
result = sem_init(work_sem, TRUE, 0); g_assert(result == 0);

for (i = 0; i < NWORKER; i++) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-04-19 at 16:32 +0100, Mel Gorman wrote:
> Fix infinite loop in get_futex_key when backed by huge pages
>
> If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
> get_futex_key() can go into an infinite loop waiting for a page->mapping
> that will never exist. This was reported and documented in an external
> bugzilla at
>
> https://bugzilla.redhat.com/show_bug.cgi?id=552257
>
> This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON
> mapped MAP_PRIVATE. This is enough for futex to continue but because
> of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by
> futex. No other part of the VM should be dereferencing the page->mapping of
> a hugetlbfs page as its page cache is not on the LRU.
>
> This patch fixes the problem with the test case described in the bugzilla.
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> ---
> include/linux/poison.h | 10 ++++++++++
> mm/hugetlb.c | 6 +++++-
> 2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 2110a81..0f7b5ac 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -48,6 +48,16 @@
> #define POISON_FREE 0x6b /* for use-after-free poisoning */
> #define POISON_END 0xa5 /* end-byte of poisoning */
>
> +/********** mm/hugetlb.c **********/
> +/*
> + * Private mappings of hugetlb pages use this poisoned value for
> + * page->mapping. The core VM should not be doing anything with this mapping
> + * but futex requires the existance of some page->mapping value even if it
> + * is unused. If the core VM does deference the mapping, it'll look like a
> + * suspiciously high null-pointer offset starting from 0x2e5
> + */
> +#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)

Wouldn't a longer poison be more recognisable? Also, shouldn't this use
POISON_POINTER_DELTA?

Something like:

#define HUGETBL_POISON ((void *) 0x00300300 + POISON_POINTER_DELTA)

0x2e5 isn't that high, I've had actual derefs in that range.

> +
> /********** arch/$ARCH/mm/init.c **********/
> #define POISON_FREE_INITMEM 0xcc
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6034dc9..487e3c2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)
>
> mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> + page->mapping = NULL;
> BUG_ON(page_count(page));
> INIT_LIST_HEAD(&page->lru);
>
> @@ -2447,8 +2448,11 @@ retry:
> spin_lock(&inode->i_lock);
> inode->i_blocks += blocks_per_huge_page(h);
> spin_unlock(&inode->i_lock);
> - } else
> + } else {
> lock_page(page);
> + page->mapping = (struct address_space *)
> + HUGETLB_PRIVATE_MAPPING;
> + }
> }
>
> /*
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-04-19 at 18:11 +0200, Andrea Arcangeli wrote:
> On Mon, Apr 19, 2010 at 05:45:05PM +0200, Peter Zijlstra wrote:
> > Wouldn't a longer poison be more recognisable? Also, shouldn't this use
> > POISON_POINTER_DELTA?
> >
> > Something like:
> >
> > #define HUGETBL_POISON ((void *) 0x00300300 + POISON_POINTER_DELTA)
> >
> > 0x2e5 isn't that high, I've had actual derefs in that range.
>
> The default at kernel config time sets only 4k as unmapped (I think
> it's a very bad default for 64bit archs), so above 4k userland can map
> it and you can have actual derefs with 0x00300300 but not with Mel's
> preferred <0x1000 address. So the address must be <0x1000.

Well, most poison values have that problem and still we have them. Also
on 64bit machines you can use POISON_POINTER_DELTA to map it outside the
virtual address range.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/