hackbench regression due to commit 9dfc6e68bfe6e [Kernel]

Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support

From: Tejun Heo on 26 Apr 2010 06:10

Hello,

On 04/26/2010 09:22 AM, Pekka Enberg wrote:
> On Mon, Apr 26, 2010 at 9:59 AM, Zhang, Yanmin
> <yanmin_zhang(a)linux.intel.com> wrote:
>>> I haven't been able to reproduce this either on my Core 2 machine.
>> Mostly, the regression exists on Nehalem machines. I suspect it's related to
>> hyper-threading machine.
>
> OK, so does anyone know why hyper-threading would change things for
> the per-CPU allocator?

My wild speculation is that previously the cpu_slub structures of two
neighboring threads ended up on the same cacheline by accident thanks
to the back to back allocation. W/ the percpu allocator, this no
longer would happen as the allocator groups percpu data together
per-cpu.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pekka Enberg on 26 Apr 2010 06:10

Hi,

On Mon, Apr 26, 2010 at 9:59 AM, Zhang, Yanmin
<yanmin_zhang(a)linux.intel.com> wrote:
>>>> I haven't been able to reproduce this either on my Core 2 machine.
>>> Mostly, the regression exists on Nehalem machines. I suspect it's related to
>>> hyper-threading machine.

On 04/26/2010 09:22 AM, Pekka Enberg wrote:
>> OK, so does anyone know why hyper-threading would change things for
>> the per-CPU allocator?

On Mon, Apr 26, 2010 at 1:02 PM, Tejun Heo <tj(a)kernel.org> wrote:
> My wild speculation is that previously the cpu_slub structures of two
> neighboring threads ended up on the same cacheline by accident thanks
> to the back to back allocation. �W/ the percpu allocator, this no
> longer would happen as the allocator groups percpu data together
> per-cpu.

Yanmin, do we see a lot of remote frees for your hackbench run? IIRC,
it's the "deactivate_remote_frees" stat when CONFIG_SLAB_STATS is
enabled.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 26 Apr 2010 07:00

On 04/26/2010 12:09 PM, Pekka Enberg wrote:
>> My wild speculation is that previously the cpu_slub structures of two
>> neighboring threads ended up on the same cacheline by accident thanks
>> to the back to back allocation. W/ the percpu allocator, this no
>> longer would happen as the allocator groups percpu data together
>> per-cpu.
>
> Yanmin, do we see a lot of remote frees for your hackbench run? IIRC,
> it's the "deactivate_remote_frees" stat when CONFIG_SLAB_STATS is
> enabled.

I'm not familiar with the details or scales here so please take
whatever I say with a grain of salt. For hyperthreading configuration
I think operations don't have to be remote to be affected. If the
data for cpu0 and cpu1 were on the same cache line, and cpu0 and cpu1
are occupying the same physical core thus sharing all the resources it
would benefit from the sharing whether any operation was remote or not
as it saves the physical processor one cache line.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pekka J Enberg on 26 Apr 2010 10:20

On 04/26/2010 12:09 PM, Pekka Enberg wrote:
>>> My wild speculation is that previously the cpu_slub structures of two
>>> neighboring threads ended up on the same cacheline by accident thanks
>>> to the back to back allocation. W/ the percpu allocator, this no
>>> longer would happen as the allocator groups percpu data together
>>> per-cpu.
>>
>> Yanmin, do we see a lot of remote frees for your hackbench run? IIRC,
>> it's the "deactivate_remote_frees" stat when CONFIG_SLAB_STATS is
>> enabled.

On Mon, 26 Apr 2010, Tejun Heo wrote:
> I'm not familiar with the details or scales here so please take
> whatever I say with a grain of salt. For hyperthreading configuration
> I think operations don't have to be remote to be affected. If the
> data for cpu0 and cpu1 were on the same cache line, and cpu0 and cpu1
> are occupying the same physical core thus sharing all the resources it
> would benefit from the sharing whether any operation was remote or not
> as it saves the physical processor one cache line.

Even if the cacheline is dirtied like in the struct kmem_cache_cpu case?
If that's the case, don't we want the per-CPU allocator to support
back to back allocation for cores that are in the same package?

Btw, I focused on remote frees initially before I understood what you
actually meant and scetched the following untested patch to take advantage
of the fact that struct kmem_cache_cpu doesn't fill a whole cache line. It
tries amortize remote free costs by "queuing" objects. It would be
interesting to see if it helps here (or in the other SLUB regressions like
netperf and the famous Intel one).

Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 0249d41..b554a67 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -34,10 +34,14 @@ enum stat_item {
ORDER_FALLBACK, /* Number of times fallback was necessary */
NR_SLUB_STAT_ITEMS };

+#define SLUB_MAX_NR_REMOTES 5
+
struct kmem_cache_cpu {
void **freelist; /* Pointer to first free per cpu object */
struct page *page; /* The slab from which we are allocating */
int node; /* The node of the page (or -1 for debug) */
+ int nr_remotes; /* Number of remotely free'd objects */
+ void *remotelist[SLUB_MAX_NR_REMOTES]; /* List of remotely free'd objects */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
diff --git a/mm/slub.c b/mm/slub.c
index 7d6c8b1..e8e5523 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1480,6 +1480,24 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
unfreeze_slab(s, page, tail);
}

+static void __slab_free(struct kmem_cache *s, struct page *page, void *x, unsigned long addr);
+
+static void flush_remotelist(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+ int i;
+
+ for (i = 0; i < c->nr_remotes; i++) {
+ struct page *page;
+ void *x;
+
+ x = c->remotelist[i];
+ page = virt_to_head_page(x);
+
+ __slab_free(s, page, x, _RET_IP_);
+ }
+ c->nr_remotes = 0;
+}
+
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
stat(s, CPUSLAB_FLUSH);
@@ -1496,7 +1514,12 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

- if (likely(c && c->page))
+ if (unlikely(!c))
+ return;
+
+ flush_remotelist(s, c);
+
+ if (likely(c->page))
flush_slab(s, c);
}

@@ -1709,6 +1732,8 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,

local_irq_save(flags);
c = __this_cpu_ptr(s->cpu_slab);
+ if (unlikely(c->nr_remotes == SLUB_MAX_NR_REMOTES))
+ flush_remotelist(s, c);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

@@ -1865,8 +1890,12 @@ static __always_inline void slab_free(struct kmem_cache *s,
set_freepointer(s, object, c->freelist);
c->freelist = object;
stat(s, FREE_FASTPATH);
- } else
- __slab_free(s, page, x, addr);
+ } else {
+ if (unlikely(c->nr_remotes == SLUB_MAX_NR_REMOTES))
+ flush_remotelist(s, c);
+
+ c->remotelist[c->nr_remotes++] = x;
+ }

local_irq_restore(flags);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 26 Apr 2010 10:40

Hello, Pekka.

On 04/26/2010 04:17 PM, Pekka J Enberg wrote:
> Even if the cacheline is dirtied like in the struct kmem_cache_cpu case?

If my hypothesis is the case, I don't think dirtying or not would
matter. It's about two cpus sharing a cache line which usually is a
bad idea but in this case happens to be a good idea because the two
cpus sit on the same cache.

> If that's the case, don't we want the per-CPU allocator to support back
> to back allocation for cores that are in the same package?

I think it's probably gonna be an over-engineering effort. W/ percpu
allocator the rest of the cacheline would likely be occupied by
another percpu item for the cpu, so it's not really wasted. It's just
used differently. It would be good if we have a way to better pack
small hot ones (for the same cpu) into the same cachelines but I don't
think it would be wise to interleave stuff from different cpus. It's
not like there's only single way to save a cacheline after all.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support