From: Wu Fengguang on
On Fri, Jan 08, 2010 at 11:32:07AM +0800, Zheng, Shaohui wrote:
> Resend the patch to the mailing-list, the original patch URL is
> http://patchwork.kernel.org/patch/69075/, it is not accepted without comments,
> sent it again to review.
>
> Memory-Hotplug: Fix the bug on interface /dev/mem for 64-bit kernel
>
> The new added memory can not be access by interface /dev/mem, because we do not
> update the variable high_memory. This patch add a new e820 entry in e820 table,
> and update max_pfn, max_low_pfn and high_memory.
>
> We add a function update_pfn in file arch/x86/mm/init.c to udpate these
> varibles. Memory hotplug does not make sense on 32-bit kernel, so we did not
> concern it in this function.
>
> Signed-off-by: Shaohui Zheng <shaohui.zheng(a)intel.com>
> CC: Andi Kleen <ak(a)linux.intel.com>
> CC: Wu Fengguang <fengguang.wu(a)intel.com>
> CC: Li Haicheng <Haicheng.li(a)intel.com>
>
> ---
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index f50447d..b986246 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -110,8 +110,8 @@ int __init e820_all_mapped(u64 start, u64 end, unsigned type)
> /*
> * Add a memory region to the kernel e820 map.
> */
> -static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
> - int type)
> +static void __meminit __e820_add_region(struct e820map *e820x, u64 start,
> + u64 size, int type)
> {
> int x = e820x->nr_map;
>
> @@ -126,7 +126,7 @@ static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
> e820x->nr_map++;
> }
>
> -void __init e820_add_region(u64 start, u64 size, int type)
> +void __meminit e820_add_region(u64 start, u64 size, int type)
> {
> __e820_add_region(&e820, start, size, type);
> }
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index d406c52..0474459 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1,6 +1,7 @@
> #include <linux/initrd.h>
> #include <linux/ioport.h>
> #include <linux/swap.h>
> +#include <linux/bootmem.h>
>
> #include <asm/cacheflush.h>
> #include <asm/e820.h>
> @@ -386,3 +387,30 @@ void free_initrd_mem(unsigned long start, unsigned long end)
> free_init_pages("initrd memory", start, end);
> }
> #endif
> +
> +/**
> + * After memory hotplug, the variable max_pfn, max_low_pfn and high_memory will
> + * be affected, it will be updated in this function. Memory hotplug does not
> + * make sense on 32-bit kernel, so we do did not concern it in this function.
> + */
> +void __meminit __attribute__((weak)) update_pfn(u64 start, u64 size)
> +{
> +#ifdef CONFIG_X86_64
> + unsigned long limit_low_pfn = 1UL<<(32 - PAGE_SHIFT);
> + unsigned long start_pfn = start >> PAGE_SHIFT;
> + unsigned long end_pfn = (start + size) >> PAGE_SHIFT;

Strictly speaking, should use "end_pfn = PFN_UP(start + size);".

> + if (end_pfn > max_pfn) {
> + max_pfn = end_pfn;
> + high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
> + }
> +
> + /* if add to low memory, update max_low_pfn */
> + if (unlikely(start_pfn < limit_low_pfn)) {
> + if (end_pfn <= limit_low_pfn)
> + max_low_pfn = end_pfn;
> + else
> + max_low_pfn = limit_low_pfn;

X86_64 actually always set max_low_pfn=max_pfn, in setup_arch():

899 #ifdef CONFIG_X86_64
900 if (max_pfn > max_low_pfn) {
901 max_pfn_mapped = init_memory_mapping(1UL<<32,
902 max_pfn<<PAGE_SHIFT);
903 /* can we preseve max_low_pfn ?*/
904 max_low_pfn = max_pfn;
905 }
906 #endif

max_low_pfn is used in

- e820_mark_nosave_regions(max_low_pfn);
- dump_pagetable()
- blk_queue_bounce_limit()
- increase_reservation()

and _seems_ to mean "end of direct addressable pfn".

> + }
> +#endif /* CONFIG_X86_64 */
> +}
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index b10ec49..6693414 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -13,6 +13,7 @@
>
> extern unsigned long max_low_pfn;
> extern unsigned long min_low_pfn;
> +extern void update_pfn(u64 start, u64 size);
>
> /*
> * highest page
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 030ce8a..ee7b2d6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -523,6 +523,14 @@ int __ref add_memory(int nid, u64 start, u64 size)
> BUG_ON(ret);
> }
>
> + /* update e820 table */

This comment can be eliminated - you already have the very readable printk :)

> + printk(KERN_INFO "Adding memory region to e820 table (start:%016Lx, size:%016Lx).\n",
> + (unsigned long long)start, (unsigned long long)size);
> + e820_add_region(start, size, E820_RAM);

> + /* update max_pfn, max_low_pfn and high_memory */
> + update_pfn(start, size);

How about renaming function to update_end_of_memory_vars()?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zheng, Shaohui on
Thanks Fengguang, see and comments in the email. Only a few different understanding on variable max_low_pfn.

Thanks & Regards,
Shaohui


-----Original Message-----
From: Wu, Fengguang
Sent: Friday, January 08, 2010 8:49 PM
To: Zheng, Shaohui
Cc: linux-mm(a)kvack.org; akpm(a)linux-foundation.org; linux-kernel(a)vger.kernel.org; ak(a)linux.intel.com; y-goto(a)jp.fujitsu.com; Dave Hansen; x86(a)kernel.org; KAMEZAWA Hiroyuki
Subject: Re: [PATCH - resend] Memory-Hotplug: Fix the bug on interface /dev/mem for 64-bit kernel(v1)

On Fri, Jan 08, 2010 at 11:32:07AM +0800, Zheng, Shaohui wrote:
> Resend the patch to the mailing-list, the original patch URL is
> http://patchwork.kernel.org/patch/69075/, it is not accepted without comments,
> sent it again to review.
>
> Memory-Hotplug: Fix the bug on interface /dev/mem for 64-bit kernel
>
> The new added memory can not be access by interface /dev/mem, because we do not
> update the variable high_memory. This patch add a new e820 entry in e820 table,
> and update max_pfn, max_low_pfn and high_memory.
>
> We add a function update_pfn in file arch/x86/mm/init.c to udpate these
> varibles. Memory hotplug does not make sense on 32-bit kernel, so we did not
> concern it in this function.
>
> Signed-off-by: Shaohui Zheng <shaohui.zheng(a)intel.com>
> CC: Andi Kleen <ak(a)linux.intel.com>
> CC: Wu Fengguang <fengguang.wu(a)intel.com>
> CC: Li Haicheng <Haicheng.li(a)intel.com>
>
> ---
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index f50447d..b986246 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -110,8 +110,8 @@ int __init e820_all_mapped(u64 start, u64 end, unsigned type)
> /*
> * Add a memory region to the kernel e820 map.
> */
> -static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
> - int type)
> +static void __meminit __e820_add_region(struct e820map *e820x, u64 start,
> + u64 size, int type)
> {
> int x = e820x->nr_map;
>
> @@ -126,7 +126,7 @@ static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
> e820x->nr_map++;
> }
>
> -void __init e820_add_region(u64 start, u64 size, int type)
> +void __meminit e820_add_region(u64 start, u64 size, int type)
> {
> __e820_add_region(&e820, start, size, type);
> }
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index d406c52..0474459 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1,6 +1,7 @@
> #include <linux/initrd.h>
> #include <linux/ioport.h>
> #include <linux/swap.h>
> +#include <linux/bootmem.h>
>
> #include <asm/cacheflush.h>
> #include <asm/e820.h>
> @@ -386,3 +387,30 @@ void free_initrd_mem(unsigned long start, unsigned long end)
> free_init_pages("initrd memory", start, end);
> }
> #endif
> +
> +/**
> + * After memory hotplug, the variable max_pfn, max_low_pfn and high_memory will
> + * be affected, it will be updated in this function. Memory hotplug does not
> + * make sense on 32-bit kernel, so we do did not concern it in this function.
> + */
> +void __meminit __attribute__((weak)) update_pfn(u64 start, u64 size)
> +{
> +#ifdef CONFIG_X86_64
> + unsigned long limit_low_pfn = 1UL<<(32 - PAGE_SHIFT);
> + unsigned long start_pfn = start >> PAGE_SHIFT;
> + unsigned long end_pfn = (start + size) >> PAGE_SHIFT;

Strictly speaking, should use "end_pfn = PFN_UP(start + size);".
[Zheng, Shaohui] I will use PFN_UP to replace it in new version.

> + if (end_pfn > max_pfn) {
> + max_pfn = end_pfn;
> + high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
> + }
> +
> + /* if add to low memory, update max_low_pfn */
> + if (unlikely(start_pfn < limit_low_pfn)) {
> + if (end_pfn <= limit_low_pfn)
> + max_low_pfn = end_pfn;
> + else
> + max_low_pfn = limit_low_pfn;

X86_64 actually always set max_low_pfn=max_pfn, in setup_arch():
[Zheng, Shaohui] there should be some misunderstanding, I read the code carefully, if the total memory is under 4G, it always max_low_pfn=max_pfn. If the total memory is larger than 4G, max_low_pfn means the end of low ram. It set max_low_pfn = e820_end_of_low_ram_pfn();.

899 #ifdef CONFIG_X86_64
900 if (max_pfn > max_low_pfn) {
901 max_pfn_mapped = init_memory_mapping(1UL<<32,
902 max_pfn<<PAGE_SHIFT);
903 /* can we preseve max_low_pfn ?*/
904 max_low_pfn = max_pfn;
905 }
906 #endif

max_low_pfn is used in

- e820_mark_nosave_regions(max_low_pfn);
- dump_pagetable()
- blk_queue_bounce_limit()
- increase_reservation()

and _seems_ to mean "end of direct addressable pfn".

> + }
> +#endif /* CONFIG_X86_64 */
> +}
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index b10ec49..6693414 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -13,6 +13,7 @@
>
> extern unsigned long max_low_pfn;
> extern unsigned long min_low_pfn;
> +extern void update_pfn(u64 start, u64 size);
>
> /*
> * highest page
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 030ce8a..ee7b2d6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -523,6 +523,14 @@ int __ref add_memory(int nid, u64 start, u64 size)
> BUG_ON(ret);
> }
>
> + /* update e820 table */

This comment can be eliminated - you already have the very readable printk :)
[Zheng, Shaohui] I will remove this comment

> + printk(KERN_INFO "Adding memory region to e820 table (start:%016Lx, size:%016Lx).\n",
> + (unsigned long long)start, (unsigned long long)size);
> + e820_add_region(start, size, E820_RAM);

> + /* update max_pfn, max_low_pfn and high_memory */
> + update_pfn(start, size);

How about renaming function to update_end_of_memory_vars()?
[Zheng, Shaohui] Agree.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
> > + /* if add to low memory, update max_low_pfn */
> > + if (unlikely(start_pfn < limit_low_pfn)) {
> > + if (end_pfn <= limit_low_pfn)
> > + max_low_pfn = end_pfn;
> > + else
> > + max_low_pfn = limit_low_pfn;
>
> X86_64 actually always set max_low_pfn=max_pfn, in setup_arch():
> [Zheng, Shaohui] there should be some misunderstanding, I read the
> code carefully, if the total memory is under 4G, it always
> max_low_pfn=max_pfn. If the total memory is larger than 4G,
> max_low_pfn means the end of low ram. It set

> max_low_pfn = e820_end_of_low_ram_pfn();.

The above line is very misleading.. In setup_arch(), it will be
overrode by the following block.

> 899 #ifdef CONFIG_X86_64
> 900 if (max_pfn > max_low_pfn) {
> 901 max_pfn_mapped = init_memory_mapping(1UL<<32,
> 902 max_pfn<<PAGE_SHIFT);
> 903 /* can we preseve max_low_pfn ?*/
> 904 max_low_pfn = max_pfn;
> 905 }
> 906 #endif

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Tue, Jan 12, 2010 at 08:30:31AM +0800, KAMEZAWA Hiroyuki wrote:
> On Mon, 11 Jan 2010 20:43:03 +0800
> Wu Fengguang <fengguang.wu(a)intel.com> wrote:
>
> > > > + /* if add to low memory, update max_low_pfn */
> > > > + if (unlikely(start_pfn < limit_low_pfn)) {
> > > > + if (end_pfn <= limit_low_pfn)
> > > > + max_low_pfn = end_pfn;
> > > > + else
> > > > + max_low_pfn = limit_low_pfn;
> > >
> > > X86_64 actually always set max_low_pfn=max_pfn, in setup_arch():
> > > [Zheng, Shaohui] there should be some misunderstanding, I read the
> > > code carefully, if the total memory is under 4G, it always
> > > max_low_pfn=max_pfn. If the total memory is larger than 4G,
> > > max_low_pfn means the end of low ram. It set
> >
> > > max_low_pfn = e820_end_of_low_ram_pfn();.
> >
> > The above line is very misleading.. In setup_arch(), it will be
> > overrode by the following block.
> >
>
> Hmmm....could you rewrite /dev/mem to use kernel/resource.c other than
> modifing e820 maps. ?
> Two reasons.
> - e820map is considerted to be stable, read-only after boot.
> - We don't need to add more x86 special codes.

Sure, here it is :)
---
x86: use the generic page_is_ram()

The generic resource based page_is_ram() works better with memory
hotplug/hotremove. So switch the x86 e820map based code to it.

CC: Andi Kleen <andi(a)firstfloor.org>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
---
arch/x86/include/asm/page_types.h | 1
arch/x86/mm/ioremap.c | 37 ----------------------------
kernel/resource.c | 17 ++++++++++++
3 files changed, 17 insertions(+), 38 deletions(-)

--- linux-mm.orig/arch/x86/include/asm/page_types.h 2010-01-12 10:31:01.000000000 +0800
+++ linux-mm/arch/x86/include/asm/page_types.h 2010-01-12 10:31:44.000000000 +0800
@@ -34,19 +34,18 @@

#ifdef CONFIG_X86_64
#include <asm/page_64_types.h>
#else
#include <asm/page_32_types.h>
#endif /* CONFIG_X86_64 */

#ifndef __ASSEMBLY__

-extern int page_is_ram(unsigned long pagenr);
extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

extern void initmem_init(unsigned long start_pfn, unsigned long end_pfn,
--- linux-mm.orig/arch/x86/mm/ioremap.c 2010-01-12 10:31:01.000000000 +0800
+++ linux-mm/arch/x86/mm/ioremap.c 2010-01-12 10:31:44.000000000 +0800
@@ -18,55 +18,18 @@
#include <asm/e820.h>
#include <asm/fixmap.h>
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/pat.h>

#include "physaddr.h"

-int page_is_ram(unsigned long pagenr)
-{
- resource_size_t addr, end;
- int i;
-
- /*
- * A special case is the first 4Kb of memory;
- * This is a BIOS owned area, not kernel ram, but generally
- * not listed as such in the E820 table.
- */
- if (pagenr == 0)
- return 0;
-
- /*
- * Second special case: Some BIOSen report the PC BIOS
- * area (640->1Mb) as ram even though it is not.
- */
- if (pagenr >= (BIOS_BEGIN >> PAGE_SHIFT) &&
- pagenr < (BIOS_END >> PAGE_SHIFT))
- return 0;
-
- for (i = 0; i < e820.nr_map; i++) {
- /*
- * Not usable memory:
- */
- if (e820.map[i].type != E820_RAM)
- continue;
- addr = (e820.map[i].addr + PAGE_SIZE-1) >> PAGE_SHIFT;
- end = (e820.map[i].addr + e820.map[i].size) >> PAGE_SHIFT;
-
-
- if ((pagenr >= addr) && (pagenr < end))
- return 1;
- }
- return 0;
-}
-
/*
* Fix up the linear direct mapping of the kernel to avoid cache attribute
* conflicts.
*/
int ioremap_change_attr(unsigned long vaddr, unsigned long size,
unsigned long prot_val)
{
unsigned long nrpages = size >> PAGE_SHIFT;
int err;
--- linux-mm.orig/kernel/resource.c 2010-01-12 10:31:01.000000000 +0800
+++ linux-mm/kernel/resource.c 2010-01-12 10:31:44.000000000 +0800
@@ -298,18 +298,35 @@ int walk_system_ram_range(unsigned long
#endif

static int __is_ram(unsigned long pfn, unsigned long nr_pages, void *arg)
{
return 24;
}

int __attribute__((weak)) page_is_ram(unsigned long pfn)
{
+#ifdef CONFIG_X86
+ /*
+ * A special case is the first 4Kb of memory;
+ * This is a BIOS owned area, not kernel ram, but generally
+ * not listed as such in the E820 table.
+ */
+ if (pfn == 0)
+ return 0;
+
+ /*
+ * Second special case: Some BIOSen report the PC BIOS
+ * area (640->1Mb) as ram even though it is not.
+ */
+ if (pfn >= (BIOS_BEGIN >> PAGE_SHIFT) &&
+ pfn < (BIOS_END >> PAGE_SHIFT))
+ return 0;
+#endif
return 24 == walk_system_ram_range(pfn, 1, NULL, __is_ram);
}

/*
* Find empty slot in the resource tree given range and alignment.
*/
static int find_resource(struct resource *root, struct resource *new,
resource_size_t size, resource_size_t min,
resource_size_t max, resource_size_t align,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Tue, Jan 12, 2010 at 09:50:12AM +0800, KAMEZAWA Hiroyuki wrote:

> Just an information.
>
> We already check kenerke/resource.c's resource information, here.
>
> read_mem()
> -> range_is_allowed()
> -> devmem_is_allowd()
> -> iomem_is_exclusive()
>
> extra calls of page_is_ram() to ask architecture's map seems redundunt.
>
> But, I know PPC guys doesn't use ioresource.c, hehe.

Another exception is !CONFIG_STRICT_DEVMEM, which makes
range_is_allowed()==1. So we still need the page_is_ram() :)

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/