slab: add memory hotplug support [Kernel]

Prev: Don't change direction flags in struct request.
Next: w35und: Update README

From: Andi Kleen on 7 Mar 2010 22:00

On Fri, Mar 05, 2010 at 02:47:04PM +0200, Anca Emanuel wrote:
> Dumb question: it is possible to hot remove the (bad) memory ? And add
> an good one ?

Not the complete DIMM, but if a specific page containing a stuck
bit or similar can be removed since 2.6.33 yes

In theory you could add new memory replacing that memory if your
hardware and your kernel supports that, but typically that's
not worth it for a few K.

> Where is the detection code for the bad module ?

Part of the code is in the kernel, part in mcelog.
It only works with ECC memory and supported systems ATM (currently
Nehalem class Intel Xeon systems)

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 7 Mar 2010 22:10

> Under certain conditions this is possible. If the bad memory was modified
> then you have a condition that requires termination of all processes that
> are using the memory. If its the kernel then you need to reboot.
>
> If the memory contains a page from disk then the memory can be moved
> elsewhere.
>
> If you can clean up a whole range like that then its possible to replace
> the memory.

Typically that's not possible because of the way DIMMs are interleaved --
the to be freed areas would be very large, and with a specific size
there are always kernel or unmovable user areas areas in the way.

In general on Linux hot DIMM replacement only works if the underlying
platform does it transparently (e.g. support memory RAID and chipkill)
and you have enough redundant memory for it.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 8 Mar 2010 18:30

On Fri, 5 Mar 2010, Nick Piggin wrote:

> > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > +/*
> > + * Drains and frees nodelists for a node on each slab cache, used for memory
> > + * hotplug. Returns -EBUSY if all objects cannot be drained on memory
> > + * hot-remove so that the node is not removed. When used because memory
> > + * hot-add is canceled, the only result is the freed kmem_list3.
> > + *
> > + * Must hold cache_chain_mutex.
> > + */
> > +static int __meminit free_cache_nodelists_node(int node)
> > +{
> > + struct kmem_cache *cachep;
> > + int ret = 0;
> > +
> > + list_for_each_entry(cachep, &cache_chain, next) {
> > + struct array_cache *shared;
> > + struct array_cache **alien;
> > + struct kmem_list3 *l3;
> > +
> > + l3 = cachep->nodelists[node];
> > + if (!l3)
> > + continue;
> > +
> > + spin_lock_irq(&l3->list_lock);
> > + shared = l3->shared;
> > + if (shared) {
> > + free_block(cachep, shared->entry, shared->avail, node);
> > + l3->shared = NULL;
> > + }
> > + alien = l3->alien;
> > + l3->alien = NULL;
> > + spin_unlock_irq(&l3->list_lock);
> > +
> > + if (alien) {
> > + drain_alien_cache(cachep, alien);
> > + free_alien_cache(alien);
> > + }
> > + kfree(shared);
> > +
> > + drain_freelist(cachep, l3, l3->free_objects);
> > + if (!list_empty(&l3->slabs_full) ||
> > + !list_empty(&l3->slabs_partial)) {
> > + /*
> > + * Continue to iterate through each slab cache to free
> > + * as many nodelists as possible even though the
> > + * offline will be canceled.
> > + */
> > + ret = -EBUSY;
> > + continue;
> > + }
> > + kfree(l3);
> > + cachep->nodelists[node] = NULL;
>
> What's stopping races of other CPUs trying to access l3 and array
> caches while they're being freed?
>

numa_node_id() will not return an offlined nodeid and cache_alloc_node()
already does a fallback to other onlined nodes in case a nodeid is passed
to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien
cannot be accessed without l3->list_lock (drain, cache_alloc_refill,
cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap).

> > + }
> > + return ret;
> > +}
> > +
> > +/*
> > + * Onlines nid either as the result of memory hot-add or canceled hot-remove.
> > + */
> > +static int __meminit slab_node_online(int nid)
> > +{
> > + int ret;
> > + mutex_lock(&cache_chain_mutex);
> > + ret = init_cache_nodelists_node(nid);
> > + mutex_unlock(&cache_chain_mutex);
> > + return ret;
> > +}
> > +
> > +/*
> > + * Offlines nid either as the result of memory hot-remove or canceled hot-add.
> > + */
> > +static int __meminit slab_node_offline(int nid)
> > +{
> > + int ret;
> > + mutex_lock(&cache_chain_mutex);
> > + ret = free_cache_nodelists_node(nid);
> > + mutex_unlock(&cache_chain_mutex);
> > + return ret;
> > +}
> > +
> > +static int __meminit slab_memory_callback(struct notifier_block *self,
> > + unsigned long action, void *arg)
> > +{
> > + struct memory_notify *mnb = arg;
> > + int ret = 0;
> > + int nid;
> > +
> > + nid = mnb->status_change_nid;
> > + if (nid < 0)
> > + goto out;
> > +
> > + switch (action) {
> > + case MEM_GOING_ONLINE:
> > + case MEM_CANCEL_OFFLINE:
> > + ret = slab_node_online(nid);
> > + break;
>
> This would explode if CANCEL_OFFLINE fails. Call it theoretical and
> put a panic() in here and I don't mind. Otherwise you get corruption
> somewhere in the slab code.
>

MEM_CANCEL_ONLINE would only fail here if a struct kmem_list3 couldn't be
allocated anywhere on the system and if that happens then the node simply
couldn't be allocated from (numa_node_id() would never return it as the
cpu's node, so it's possible to fallback in this scenario).

Instead of doing this all at MEM_GOING_OFFLINE, we could delay freeing of
the array caches and the nodelist until MEM_OFFLINE. We're guaranteed
that all pages are freed at that point so there are no existing objects
that we need to track and then if the offline fails from a different
callback it would be possible to reset the l3->nodelists[node] pointers
since they haven't been freed yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 9 Mar 2010 08:50

On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote:
> On Fri, 5 Mar 2010, Nick Piggin wrote:
>
> > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > +/*
> > > + * Drains and frees nodelists for a node on each slab cache, used for memory
> > > + * hotplug. Returns -EBUSY if all objects cannot be drained on memory
> > > + * hot-remove so that the node is not removed. When used because memory
> > > + * hot-add is canceled, the only result is the freed kmem_list3.
> > > + *
> > > + * Must hold cache_chain_mutex.
> > > + */
> > > +static int __meminit free_cache_nodelists_node(int node)
> > > +{
> > > + struct kmem_cache *cachep;
> > > + int ret = 0;
> > > +
> > > + list_for_each_entry(cachep, &cache_chain, next) {
> > > + struct array_cache *shared;
> > > + struct array_cache **alien;
> > > + struct kmem_list3 *l3;
> > > +
> > > + l3 = cachep->nodelists[node];
> > > + if (!l3)
> > > + continue;
> > > +
> > > + spin_lock_irq(&l3->list_lock);
> > > + shared = l3->shared;
> > > + if (shared) {
> > > + free_block(cachep, shared->entry, shared->avail, node);
> > > + l3->shared = NULL;
> > > + }
> > > + alien = l3->alien;
> > > + l3->alien = NULL;
> > > + spin_unlock_irq(&l3->list_lock);
> > > +
> > > + if (alien) {
> > > + drain_alien_cache(cachep, alien);
> > > + free_alien_cache(alien);
> > > + }
> > > + kfree(shared);
> > > +
> > > + drain_freelist(cachep, l3, l3->free_objects);
> > > + if (!list_empty(&l3->slabs_full) ||
> > > + !list_empty(&l3->slabs_partial)) {
> > > + /*
> > > + * Continue to iterate through each slab cache to free
> > > + * as many nodelists as possible even though the
> > > + * offline will be canceled.
> > > + */
> > > + ret = -EBUSY;
> > > + continue;
> > > + }
> > > + kfree(l3);
> > > + cachep->nodelists[node] = NULL;
> >
> > What's stopping races of other CPUs trying to access l3 and array
> > caches while they're being freed?
> >
>
> numa_node_id() will not return an offlined nodeid and cache_alloc_node()
> already does a fallback to other onlined nodes in case a nodeid is passed
> to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien
> cannot be accessed without l3->list_lock (drain, cache_alloc_refill,
> cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap).

Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here)
while it is being accessed by another CPU and concurrently being freed
on this one?

> > > + }
> > > + return ret;
> > > +}
> > > +
> > > +/*
> > > + * Onlines nid either as the result of memory hot-add or canceled hot-remove.
> > > + */
> > > +static int __meminit slab_node_online(int nid)
> > > +{
> > > + int ret;
> > > + mutex_lock(&cache_chain_mutex);
> > > + ret = init_cache_nodelists_node(nid);
> > > + mutex_unlock(&cache_chain_mutex);
> > > + return ret;
> > > +}
> > > +
> > > +/*
> > > + * Offlines nid either as the result of memory hot-remove or canceled hot-add.
> > > + */
> > > +static int __meminit slab_node_offline(int nid)
> > > +{
> > > + int ret;
> > > + mutex_lock(&cache_chain_mutex);
> > > + ret = free_cache_nodelists_node(nid);
> > > + mutex_unlock(&cache_chain_mutex);
> > > + return ret;
> > > +}
> > > +
> > > +static int __meminit slab_memory_callback(struct notifier_block *self,
> > > + unsigned long action, void *arg)
> > > +{
> > > + struct memory_notify *mnb = arg;
> > > + int ret = 0;
> > > + int nid;
> > > +
> > > + nid = mnb->status_change_nid;
> > > + if (nid < 0)
> > > + goto out;
> > > +
> > > + switch (action) {
> > > + case MEM_GOING_ONLINE:
> > > + case MEM_CANCEL_OFFLINE:
> > > + ret = slab_node_online(nid);
> > > + break;
> >
> > This would explode if CANCEL_OFFLINE fails. Call it theoretical and
> > put a panic() in here and I don't mind. Otherwise you get corruption
> > somewhere in the slab code.
> >
>
> MEM_CANCEL_ONLINE would only fail here if a struct kmem_list3 couldn't be
> allocated anywhere on the system and if that happens then the node simply
> couldn't be allocated from (numa_node_id() would never return it as the
> cpu's node, so it's possible to fallback in this scenario).

Why would it never return the CPU's node? It's CANCEL_OFFLINE that is
the problem.

> Instead of doing this all at MEM_GOING_OFFLINE, we could delay freeing of
> the array caches and the nodelist until MEM_OFFLINE. We're guaranteed
> that all pages are freed at that point so there are no existing objects
> that we need to track and then if the offline fails from a different
> callback it would be possible to reset the l3->nodelists[node] pointers
> since they haven't been freed yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pekka Enberg on 22 Mar 2010 13:30

Nick Piggin wrote:
> On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote:
>> On Fri, 5 Mar 2010, Nick Piggin wrote:
>>
>>>> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
>>>> +/*
>>>> + * Drains and frees nodelists for a node on each slab cache, used for memory
>>>> + * hotplug. Returns -EBUSY if all objects cannot be drained on memory
>>>> + * hot-remove so that the node is not removed. When used because memory
>>>> + * hot-add is canceled, the only result is the freed kmem_list3.
>>>> + *
>>>> + * Must hold cache_chain_mutex.
>>>> + */
>>>> +static int __meminit free_cache_nodelists_node(int node)
>>>> +{
>>>> + struct kmem_cache *cachep;
>>>> + int ret = 0;
>>>> +
>>>> + list_for_each_entry(cachep, &cache_chain, next) {
>>>> + struct array_cache *shared;
>>>> + struct array_cache **alien;
>>>> + struct kmem_list3 *l3;
>>>> +
>>>> + l3 = cachep->nodelists[node];
>>>> + if (!l3)
>>>> + continue;
>>>> +
>>>> + spin_lock_irq(&l3->list_lock);
>>>> + shared = l3->shared;
>>>> + if (shared) {
>>>> + free_block(cachep, shared->entry, shared->avail, node);
>>>> + l3->shared = NULL;
>>>> + }
>>>> + alien = l3->alien;
>>>> + l3->alien = NULL;
>>>> + spin_unlock_irq(&l3->list_lock);
>>>> +
>>>> + if (alien) {
>>>> + drain_alien_cache(cachep, alien);
>>>> + free_alien_cache(alien);
>>>> + }
>>>> + kfree(shared);
>>>> +
>>>> + drain_freelist(cachep, l3, l3->free_objects);
>>>> + if (!list_empty(&l3->slabs_full) ||
>>>> + !list_empty(&l3->slabs_partial)) {
>>>> + /*
>>>> + * Continue to iterate through each slab cache to free
>>>> + * as many nodelists as possible even though the
>>>> + * offline will be canceled.
>>>> + */
>>>> + ret = -EBUSY;
>>>> + continue;
>>>> + }
>>>> + kfree(l3);
>>>> + cachep->nodelists[node] = NULL;
>>> What's stopping races of other CPUs trying to access l3 and array
>>> caches while they're being freed?
>>>
>> numa_node_id() will not return an offlined nodeid and cache_alloc_node()
>> already does a fallback to other onlined nodes in case a nodeid is passed
>> to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien
>> cannot be accessed without l3->list_lock (drain, cache_alloc_refill,
>> cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap).
>
> Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here)
> while it is being accessed by another CPU and concurrently being freed
> on this one?
>
>
>>>> + }
>>>> + return ret;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Onlines nid either as the result of memory hot-add or canceled hot-remove.
>>>> + */
>>>> +static int __meminit slab_node_online(int nid)
>>>> +{
>>>> + int ret;
>>>> + mutex_lock(&cache_chain_mutex);
>>>> + ret = init_cache_nodelists_node(nid);
>>>> + mutex_unlock(&cache_chain_mutex);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Offlines nid either as the result of memory hot-remove or canceled hot-add.
>>>> + */
>>>> +static int __meminit slab_node_offline(int nid)
>>>> +{
>>>> + int ret;
>>>> + mutex_lock(&cache_chain_mutex);
>>>> + ret = free_cache_nodelists_node(nid);
>>>> + mutex_unlock(&cache_chain_mutex);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int __meminit slab_memory_callback(struct notifier_block *self,
>>>> + unsigned long action, void *arg)
>>>> +{
>>>> + struct memory_notify *mnb = arg;
>>>> + int ret = 0;
>>>> + int nid;
>>>> +
>>>> + nid = mnb->status_change_nid;
>>>> + if (nid < 0)
>>>> + goto out;
>>>> +
>>>> + switch (action) {
>>>> + case MEM_GOING_ONLINE:
>>>> + case MEM_CANCEL_OFFLINE:
>>>> + ret = slab_node_online(nid);
>>>> + break;
>>> This would explode if CANCEL_OFFLINE fails. Call it theoretical and
>>> put a panic() in here and I don't mind. Otherwise you get corruption
>>> somewhere in the slab code.
>>>
>> MEM_CANCEL_ONLINE would only fail here if a struct kmem_list3 couldn't be
>> allocated anywhere on the system and if that happens then the node simply
>> couldn't be allocated from (numa_node_id() would never return it as the
>> cpu's node, so it's possible to fallback in this scenario).
>
> Why would it never return the CPU's node? It's CANCEL_OFFLINE that is
> the problem.

So I was thinking of pushing this towards Linus but I didn't see anyone
respond to Nick's concerns. I'm not that familiar with all this hotplug
stuff so can someone make also Nick happy so we can move forward?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Don't change direction flags in struct request.
Next: w35und: Update README