Tight check of pfn_valid on sparsemem [Kernel]

Prev: Linux 2.6.35-rc5
Next: OCFS2: Allow huge (> 16 TiB) volumes to mount

From: KAMEZAWA Hiroyuki on 13 Jul 2010 04:10

On Tue, 13 Jul 2010 17:06:56 +0900
Minchan Kim <minchan.kim(a)gmail.com> wrote:

> On Tue, Jul 13, 2010 at 3:40 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> > On Tue, 13 Jul 2010 15:04:00 +0900
> > Minchan Kim <minchan.kim(a)gmail.com> wrote:
> >
> >> >> > 2. This can't be help for a case where a section has multiple small holes.
> >> >>
> >> >> I agree. But this(not punched hole but not filled section problem)
> >> >> isn't such case. But it would be better to handle it altogether. :)
> >> >>
> >> >> >
> >> >> > Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
> >> >> > ==
> >> >> > Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
> >> >> > To handle that, pfn_valid() should check there are really memmap or not.
> >> >> > For that purpose, __get_user() can be used.
> >> >>
> >> >> Look at free_unused_memmap. We don't unmap pte of hole memmap.
> >> >> Is __get_use effective, still?
> >> >>
> >> > __get_user() works with TLB and page table, the vaddr is really mapped or not.
> >> > If you got SEGV, __get_user() returns -EFAULT. It works per page granule.
> >>
> >> I mean following as.
> >> For example, there is a struct page in on 0x20000000.
> >>
> >> int pfn_valid_mapped(unsigned long pfn)
> >> {
> >> struct page *page = pfn_to_page(pfn); /* hole page is 0x2000000 */
> >> char *lastbyte = (char *)(page+1)-1; /* lastbyte is 0x2000001f */
> >> char byte;
> >>
> >> /* We pass this test since free_unused_memmap doesn't unmap pte */
> >> if(__get_user(byte, page) != 0)
> >> return 0;
> >
> > why ? When the page size is 4096 byte.
> >
> > 0x1ffff000 - 0x1ffffffff
> > 0x20000000 - 0x200000fff are on the same page. And memory is mapped per page.
>
> sizeof(struct page) is 32 byte.
> So lastbyte is address of struct page + 32 byte - 1.
>
> > What we access by above __get_user() is a byte at [0x20000000, 0x20000001)
>
> Right.
>
> > and it's unmapped if 0x20000000 is unmapped.
>
> free_unused_memmap doesn't unmap pte although it returns the page to
> free list of buddy.
>
ok, I understood. please see my latest mail and ignore all others.

-kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 13 Jul 2010 05:40

On Tue, Jul 13, 2010 at 12:53:48AM +0900, Minchan Kim wrote:
> Kukjin, Could you test below patch?
> I don't have any sparsemem system. Sorry.
>
> -- CUT DOWN HERE --
>
> Kukjin reported oops happen while he change min_free_kbytes
> http://www.spinics.net/lists/arm-kernel/msg92894.html
> It happen by memory map on sparsemem.
>
> The system has a memory map following as.
> section 0 section 1 section 2
> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> SECTION_SIZE_BITS 28(256M)
>
> It means section 0 is an incompletely filled section.
> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>
> It checks only mem_section's validation.

Note that this is meant to be fine as far as sparsemem is concerned. This is
how it functions by design - if a section is valid, all pages within that
section are valid. Hence, I consider the comment "sparsemem checks pfn loosely"
unfair as it's as strong as required until someone breaks the model.

It's ARM that is punching holes here and as the hole is at the beginning
of a section, the zone bounaries could have been adjusted after the holes
was punched.

Anyway...

> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> It's not what we want.
>
> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
>
> Signed-off-by: Minchan Kim <minchan.kim(a)gmail.com>
> Reported-by: Kukjin Kim <kgene.kim(a)samsung.com>
>
> P.S)
> It is just RFC. If we agree with this, I will make the patch on mmotm.
>
> --
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b4d109e..6c2147a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -979,6 +979,8 @@ struct mem_section {
> struct page_cgroup *page_cgroup;
> unsigned long pad;
> #endif
> + unsigned long start_pfn;
> + unsigned long end_pfn;
> };

ouch, 8 to 16 bytes increase on a commonly-used structure. Minimally, I
would only expect this to be defined for
CONFIG_ARCH_HAS_HOLES_MEMORYMODEL. Similarly for the rest of the patch,
I'd expect the "strong" pfn_valid checks to only exist on ARM because
nothing else needs it.

>
> #ifdef CONFIG_SPARSEMEM_EXTREME
> @@ -1039,6 +1041,12 @@ static inline int valid_section(struct mem_section *section)
> return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> }
>
> +static inline int valid_section_pfn(struct mem_section *section, unsigned long pfn)
> +{
> + return ((section && (section->section_mem_map & SECTION_HAS_MEM_MAP)) &&
> + (section->start_pfn <= pfn && pfn < section->end_pfn));
> +}
> +
> static inline int valid_section_nr(unsigned long nr)
> {
> return valid_section(__nr_to_section(nr));
> @@ -1053,7 +1061,7 @@ static inline int pfn_valid(unsigned long pfn)
> {
> if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
> return 0;
> - return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
> + return valid_section_pfn(__nr_to_section(pfn_to_section_nr(pfn)), pfn);
> }
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 95ac219..bde9090 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -195,6 +195,8 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
> if (!ms->section_mem_map)
> ms->section_mem_map = sparse_encode_early_nid(nid) |
> SECTION_MARKED_PRESENT;
> + ms->start_pfn = start;
> + ms->end_pfn = end;
> }
> }
>

I'd also expect the patch to then delete memmap_valid_within and replace
it with a pfn_valid_within() check (which in turn should call the strong
version of pfn_valid().

I prefer Kamezawa's suggestion of mapping on a ZERO_PAGE-like page full
of PageReserved struct pages because it would have better performance
and be more in line with maintaining the assumptions of the memory
model. If we go in this direction, I would strongly prefer it was an
ARM-only thing.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Johannes Weiner on 13 Jul 2010 05:40

From: Russell King - ARM Linux on 13 Jul 2010 05:50

On Tue, Jul 13, 2010 at 10:37:00AM +0100, Mel Gorman wrote:
> I prefer Kamezawa's suggestion of mapping on a ZERO_PAGE-like page full
> of PageReserved struct pages because it would have better performance
> and be more in line with maintaining the assumptions of the memory
> model. If we go in this direction, I would strongly prefer it was an
> ARM-only thing.

As I've said, this is not possible without doing some serious page
manipulation.

Plus the pages that where there become unusable as they don't correspond
with a PFN or obey phys_to_virt(). So there's absolutely no point to
this.

Now, why do we free the holes in the mem_map - because these holes can
be extremely large. Every 512K of hole equates to one page of mem_map
array. Balance that against memory placed at 0xc0000000 physical on
some platforms, and with PHYSMEM_BITS at 32 and SECTION_SIZE_BITS at
19 - well, you do the maths. The result is certainly not pretty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 13 Jul 2010 06:10

On Tue, Jul 13, 2010 at 10:46:12AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 13, 2010 at 10:37:00AM +0100, Mel Gorman wrote:
> > I prefer Kamezawa's suggestion of mapping on a ZERO_PAGE-like page full
> > of PageReserved struct pages because it would have better performance
> > and be more in line with maintaining the assumptions of the memory
> > model. If we go in this direction, I would strongly prefer it was an
> > ARM-only thing.
>
> As I've said, this is not possible without doing some serious page
> manipulation.
>

Yep, x86 used to do something like this for discontig. It wasn't pretty.

> Plus the pages that where there become unusable as they don't correspond
> with a PFN or obey phys_to_virt(). So there's absolutely no point to
> this.
>
> Now, why do we free the holes in the mem_map - because these holes can
> be extremely large. Every 512K of hole equates to one page of mem_map
> array.

Sure, the holes might be large but at least they are contiguous. Is
there ever a case where you have

512K_Valid 512K_Hole 512K_Valid 512K_Hole

or is it typically

512K_hole 512K_hole ...... 512K_Valid 512K_Valid etc

If holes are typically contiguos, memmap is not allocated in the first place
and the savings from punching holes in memmap in the latter configuration
are minimal.

I recognise if you have a 2M section with a hole in it, you are
potentially wasting 3 pages on unused memmap. If this is really a problem,
Minchan's modification to sparsemem to increase the size of mem_section on
CONFIG_ARCH_HAS_HOLES_MEMORYMODEL is a messy option. I say messy because
it only works if the hole is on either end of the section and it's adding
quirks to the memory model.

> Balance that against memory placed at 0xc0000000 physical on
> some platforms, and with PHYSMEM_BITS at 32 and SECTION_SIZE_BITS at
> 19 - well, you do the maths. The result is certainly not pretty.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Linux 2.6.35-rc5
Next: OCFS2: Allow huge (> 16 TiB) volumes to mount