From: Ankita Garg on
Hi,

On Thu, May 13, 2010 at 07:36:30PM +0800, Shaohui Zheng wrote:
> Hi, All
> This patchset introduces NUMA hotplug emulator for x86. it refers too
> many files and might introduce new bugs, so we send a RFC to comminity first
> and expect comments and suggestions, thanks.
>

<snip>

> * Principles & Usages
>
> NUMA hotplug emulator include 3 different parts, We add a menu item to the
> menuconfig to enable/disable them
> (Refer to http://shaohui.org/images/hpe-krnl-cfg.jpg)
>
>
> 1) Node hotplug emulation:
>
> The emulator firstly hides RAM via E820 table, and then it can
> fake offlined nodes with the hidden RAM.
>
> After system bootup, user is able to hotplug-add these offlined
> nodes, which is just similar to a real hotplug hardware behavior.
>
> Using boot option "numa=hide=N*size" to fake offlined nodes:
> - N is the number of hidden nodes
> - size is the memory size (in MB) per hidden node.
>
> There is a sysfs entry "probe" under /sys/devices/system/node/ for user
> to hotplug the fake offlined nodes:
>
> - to show all fake offlined nodes:
> $ cat /sys/devices/system/node/probe
>
> - to hotadd a fake offlined node, e.g. nodeid is N:
> $ echo N > /sys/devices/system/node/probe
>

I tried the patchset on a non-NUMA machine. So, inorder to create fake
NUMA nodes and be able to emulate the hotplug behavior, I used the
following commandline:

"numa=fake=4 numa=hide=2*2048"

on a machine with 8G memory. I expected to see 4 nodes, out of which 2
would be hidden. However, the system comes up the 4 online nodes and 2
offline nodes (thus a total of 6 nodes). While we could decide this to
be the semantics, however, I feel that numa=fake should define the total
number of nodes. So in the above case, the system should have come up
with 2 online nodes and 2 offline nodes.

Also, "numa=hide=N" could also be supported, with the size
of the hidden nodes being equal to the entire size of the node, with or
without numa=fake parameter.

On onlining one of the offline nodes, I see another issue that the
memory under it is not automatically brought online. For example:

#ls /sys/devices/system/node
..... node0 node1 node2..

#cat /sys/devices/system/node/probe
3

#echo 3 > /sys/devices/system/node/probe
#ls /sys/devices/system/node
..... node0 node1 node2 node3

#cat /sys/devices/system/node/node3/meminfo
Node 3 MemTotal: 0 kB
Node 3 MemFree: 0 kB
Node 3 MemUsed: 0 kB
Node 3 Active: 0 kB
.......

i.e, as memory-less nodes. However, these nodes were designated to have
memory. So, on onlining the nodes, maybe we could have all their memory
brought into online state as well ?

--
Regards,
Ankita Garg (ankita(a)in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ankita Garg on
Hi,

On Thu, May 13, 2010 at 07:48:35PM +0800, Shaohui Zheng wrote:
> Userland interface to hotplug-add fake offlined nodes.
>
> Add a sysfs entry "probe" under /sys/devices/system/node/:
>
> - to show all fake offlined nodes:
> $ cat /sys/devices/system/node/probe
>
> - to hotadd a fake offlined node, e.g. nodeid is N:
> $ echo N > /sys/devices/system/node/probe
>
> Signed-off-by: Haicheng Li <haicheng.li(a)linux.intel.com>
> Signed-off-by: Shaohui Zheng <shaohui.zheng(a)intel.com>
> ---
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9458685..2c078c8 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1214,6 +1214,20 @@ config NUMA_EMU
> into virtual nodes when booted with "numa=fake=N", where N is the
> number of nodes. This is only useful for debugging.
>
> +config NUMA_HOTPLUG_EMU
> + bool "NUMA hotplug emulator"
> + depends on X86_64 && NUMA && HOTPLUG
> + ---help---
> +
> +config NODE_HOTPLUG_EMU
> + bool "Node hotplug emulation"
> + depends on NUMA_HOTPLUG_EMU && MEMORY_HOTPLUG
> + ---help---
> + Enable Node hotplug emulation. The machine will be setup with
> + hidden virtual nodes when booted with "numa=hide=N*size", where
> + N is the number of hidden nodes, size is the memory size per
> + hidden node. This is only useful for debugging.
> +

The above dependencies do not work as expected. I could configure
NUMA_HOTPLUG_EMU & NODE_HOTPLUG_EMU without having MEMORY_HOTPLUG
turned on. By pushing the above definition below SPARSEMEM and memory
hot add and remove, the dependencies could be sorted out.

--
Regards,
Ankita Garg (ankita(a)in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Shaohui Zheng on
On Fri, May 21, 2010 at 03:41:04PM +0530, Ankita Garg wrote:
> Hi,
>
> On Thu, May 13, 2010 at 08:00:16PM +0800, Shaohui Zheng wrote:
> > hotplug emulator:extend memory probe interface to support NUMA
> >
> > Signed-off-by: Shaohui Zheng <shaohui.zheng(a)intel.com>
> > Signed-off-by: Haicheng Li <haicheng.li(a)intel.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> > ---
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 54ccb0d..787024f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1239,6 +1239,17 @@ config ARCH_CPU_PROBE_RELEASE
> > is for cpu hot-add/hot-remove to specified node in software method.
> > This is for debuging and testing purpose
> >
> > +config ARCH_MEMORY_PROBE
>
> The above symbol exists already...
Yes, we create CONFIG_NUMA_HOTPLUG_EMU, CONFIG_NODE_HOTPLUG_EMU and CONFIG_ARCH_CPU_PROBE_RELEASE options,
and move CONFIG_ARCH_MEMORY_PROBE together with the above 3 options.


>
> > + def_bool y
> > + bool "Memory hotplug emulation"
> > + depends on NUMA_HOTPLUG_EMU
> > + ---help---
> > + Enable memory hotplug emulation. Reserve memory with grub parameter
> > + "mem=N"(such as mem=1024M), where N is the initial memory size, the
> > + rest physical memory will be removed from e820 table; the memory probe
> > + interface is for memory hot-add to specified node in software method.
> > + This is for debuging and testing purpose
> > +
> > config NODES_SHIFT
> > int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
> > range 1 10
>
>
> --
> Regards,
> Ankita Garg (ankita(a)in.ibm.com)
> Linux Technology Center
> IBM India Systems & Technology Labs,
> Bangalore, India

--
Thanks & Regards,
Shaohui

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Shaohui Zheng on
On Fri, May 21, 2010 at 03:38:16PM +0530, Ankita Garg wrote:
> Hi,
>
> On Thu, May 13, 2010 at 07:48:35PM +0800, Shaohui Zheng wrote:
> > Userland interface to hotplug-add fake offlined nodes.
> >
> > Add a sysfs entry "probe" under /sys/devices/system/node/:
> >
> > - to show all fake offlined nodes:
> > $ cat /sys/devices/system/node/probe
> >
> > - to hotadd a fake offlined node, e.g. nodeid is N:
> > $ echo N > /sys/devices/system/node/probe
> >
> > Signed-off-by: Haicheng Li <haicheng.li(a)linux.intel.com>
> > Signed-off-by: Shaohui Zheng <shaohui.zheng(a)intel.com>
> > ---
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 9458685..2c078c8 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1214,6 +1214,20 @@ config NUMA_EMU
> > into virtual nodes when booted with "numa=fake=N", where N is the
> > number of nodes. This is only useful for debugging.
> >
> > +config NUMA_HOTPLUG_EMU
> > + bool "NUMA hotplug emulator"
> > + depends on X86_64 && NUMA && HOTPLUG
> > + ---help---
> > +
> > +config NODE_HOTPLUG_EMU
> > + bool "Node hotplug emulation"
> > + depends on NUMA_HOTPLUG_EMU && MEMORY_HOTPLUG
> > + ---help---
> > + Enable Node hotplug emulation. The machine will be setup with
> > + hidden virtual nodes when booted with "numa=hide=N*size", where
> > + N is the number of hidden nodes, size is the memory size per
> > + hidden node. This is only useful for debugging.
> > +
>
> The above dependencies do not work as expected. I could configure
> NUMA_HOTPLUG_EMU & NODE_HOTPLUG_EMU without having MEMORY_HOTPLUG
> turned on. By pushing the above definition below SPARSEMEM and memory
> hot add and remove, the dependencies could be sorted out.
Ankita,
The emulation code was tested by many times, but we did not try each
combination for the new options, good catching.
We will includes your suggestion in the formal patch. thanks so much.
>
> --
> Regards,
> Ankita Garg (ankita(a)in.ibm.com)
> Linux Technology Center
> IBM India Systems & Technology Labs,
> Bangalore, India

--
Thanks & Regards,
Shaohui

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Shaohui Zheng on
On Fri, May 21, 2010 at 03:03:40PM +0530, Ankita Garg wrote:
>
> I tried the patchset on a non-NUMA machine. So, inorder to create fake
> NUMA nodes and be able to emulate the hotplug behavior, I used the
> following commandline:
>
> "numa=fake=4 numa=hide=2*2048"
>
> on a machine with 8G memory. I expected to see 4 nodes, out of which 2
> would be hidden. However, the system comes up the 4 online nodes and 2
> offline nodes (thus a total of 6 nodes). While we could decide this to
> be the semantics, however, I feel that numa=fake should define the total
> number of nodes. So in the above case, the system should have come up
> with 2 online nodes and 2 offline nodes.
Ankita,
it is the expected result, NUMA_EMU and NUMA_HOTPLUG_EMU are 2 different
features, there is no dependency between the 2 features. Even if you disable
NUMA_EMU, the hotplug emualation still working, this implementatin reduces the
dependency, it make things simple and easy to understand.
You concern makes sense in semantices, but we do not pefer to combine 2
independent modules together.
>
> Also, "numa=hide=N" could also be supported, with the size
> of the hidden nodes being equal to the entire size of the node, with or
> without numa=fake parameter.
>
> On onlining one of the offline nodes, I see another issue that the
> memory under it is not automatically brought online. For example:
>
> #ls /sys/devices/system/node
> .... node0 node1 node2..
>
> #cat /sys/devices/system/node/probe
> 3
>
> #echo 3 > /sys/devices/system/node/probe
> #ls /sys/devices/system/node
> .... node0 node1 node2 node3
>
> #cat /sys/devices/system/node/node3/meminfo
> Node 3 MemTotal: 0 kB
> Node 3 MemFree: 0 kB
> Node 3 MemUsed: 0 kB
> Node 3 Active: 0 kB
> ......
>
> i.e, as memory-less nodes. However, these nodes were designated to have
> memory. So, on onlining the nodes, maybe we could have all their memory
> brought into online state as well ?
it is the same result with the real implemetation for memory hotplug in linux
kernel, when we hot-add physical memory into machine, the linux kernel create
the memory entires and create the related data structure, but the OS will never
online the memory, it should finish in user space.

the node hotplug emulation and memory hotplug emualtioni feature follows up the
same rules with the kernel.

As we know, when we allocate memory from a memory-less node, it will cause a
OOM issue, Some engineer is already focus on this bug. Because of the OOM issue
can be reproduced with the hotplug emulator, it helps the engineer so much.

This feature is flexible. As I know, Some OSV already online the hotplug memory
automatically, if the mainline kernel decide do the same thing, we will change
the related code, too.

>
> --
> Regards,
> Ankita Garg (ankita(a)in.ibm.com)
> Linux Technology Center
> IBM India Systems & Technology Labs,
> Bangalore, India

--
Thanks & Regards,
Shaohui

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/