From: Eric Dumazet on
Eric Dumazet a écrit :
> Martin Schwidefsky a écrit :
>> On Mon, 28 Sep 2009 08:39:31 -0700 (PDT)
>> Linus Torvalds <torvalds(a)linux-foundation.org> wrote:
>>
>>> On Mon, 28 Sep 2009, Eric Dumazet wrote:
>>>> Linus Torvalds a écrit :
>>>>> Go wild, test it out, and let us know about any regressions you find,
>>>> Something seems wrong with process time accounting.
>>>>
>>>> Following program should consume 10*8 seconds of cpu on a 8 cpu machine, but
>>>> with 2.6.32-rc1 numbers are crazy.
>>> Ok, so the top-level process time looks sane _while_ the thing is running
>>> (sum of all threads), but the per-thread times look broken (as if they had
>>> randomly had the times of the other threads added into them - looks to me
>>> like they on average had half the other threads' time accounted into
>>> them).
>>>
>>> And then at the end, it looks like the time of the threads (which was
>>> over-accounted) get re-accounted back into the main process, so the final
>>> time is indeed wildly inflated.
>>>
>>> IOW, looks like we're adding thread times multiple times to other threads
>>> (and then finally to the parent).
>>>
>>> I'm adding the usual suspects to the cc, and leaving your results and
>>> test-program quoted here for them.. Thomas, Martin, John - if you're not
>>> the people I should blame for this, let me know.
>> Uh-oh.. usual suspects, eh?
>>
>> For starters I run the test program on an s390 with
>> VIRT_CPU_ACCOUNTING=y:
>>
>> time ./burn-cpu
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:08 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> - - Rl+ 0:01 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:16 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> - - Rl+ 0:02 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:25 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> - - Rl+ 0:03 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:33 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> - - Rl+ 0:04 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:41 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> - - Rl+ 0:05 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:50 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> - - Rl+ 0:06 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 0:58 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> - - Rl+ 0:07 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 1:07 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> - - Rl+ 0:08 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 1:15 ./burn-cpu
>> - - Sl+ 0:00 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> - - Rl+ 0:09 -
>> PID TTY STAT TIME COMMAND
>> 2979 pts/0 - 1:25 ./burn-cpu
>> - - S+ 1:25 -
>>
>> real 0m10.708s
>> user 1m25.051s
>> sys 0m0.174s
>>
>> looks better, gives an average of 10.63 seconds per thread and the per
>> thread numbers are fine. I'll see what happens with the test case on my
>> pc(a)home.
>>
>
>
> I did a bisection and found commit def0a9b2573e00ab0b486cb5382625203ab4c4a6
> was the origin of the problem on my x86_32 machine.
>
> def0a9b2573e00ab0b486cb5382625203ab4c4a6 is first bad commit
> commit def0a9b2573e00ab0b486cb5382625203ab4c4a6
> Author: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> Date: Fri Sep 18 20:14:01 2009 +0200
>
> sched_clock: Make it NMI safe
>
> Arjan complained about the suckyness of TSC on modern machines, and
> asked if we could do something about that for PERF_SAMPLE_TIME.
>
> Make cpu_clock() NMI safe by removing the spinlock and using
> cmpxchg. This also makes it smaller and more robust.
>
> Affects architectures that use HAVE_UNSTABLE_SCHED_CLOCK, i.e. IA64
> and x86.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> LKML-Reference: <new-submission>
> Signed-off-by: Ingo Molnar <mingo(a)elte.hu>
>

Pretty calm lkml these days... must be some kind of event in the states ? :)

Checking this commit, I believe problem comes from cmpxchg(), which doesnt
handle 64 bit on X86_32 (no compilation error, and null operation :( )

We have two (or three choices) :

1) Use cmpxchg64()

2) Fix cmpxchg() to handle 64bit as well (or issue a compilation error)

3) Revert Peter patch :(

Here is the patch I used to get proper operation.

[PATCH] sched_clock: Use cmpxchg64()

Commit def0a9b2573e00ab0b486cb5382625203ab4c4a6
(sched_clock: Make it NMI safe) assumed cmpxchg() of 64bit values was available on X86_32

Signed-off-by: Eric Dumazet <eric.dumazet(a)gmail.com>
---

diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index ac2e1dc..479ce56 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -127,7 +127,7 @@ again:
clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);

- if (cmpxchg(&scd->clock, old_clock, clock) != old_clock)
+ if (cmpxchg64(&scd->clock, old_clock, clock) != old_clock)
goto again;

return clock;
@@ -163,7 +163,7 @@ again:
val = remote_clock;
}

- if (cmpxchg(ptr, old_val, val) != old_val)
+ if (cmpxchg64(ptr, old_val, val) != old_val)
goto again;

return val;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Tue, 29 Sep 2009, Eric Dumazet wrote:
>
> Checking this commit, I believe problem comes from cmpxchg(), which doesnt
> handle 64 bit on X86_32 (no compilation error, and null operation :( )

Wow. That's broken. Very nasty silent failure.

> 1) Use cmpxchg64()

Clearly better than what we have now, although cmpxchg64 does result is
some really disgusting code. We will use a bare CMPXCHG64 only if you
compile for PAE right now - so even if you tell Kconfig that you want to
compile for a modern CPU, we won't be doing that whole cmpxchg64b thing
directly, we'll inline some really disgusting code.

So we really need to fix that up before it would be acceptable to use
cmpxchg64().

And regardless, we should fix the silent cmpxchg failure, even if it's
just a link-time failure or something. 64-bit things in the kernel used to
be rare and special, but they aren't any more.

Ingo, Peter, comments?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arjan van de Ven on
On Tue, 29 Sep 2009 14:17:58 -0700 (PDT)
Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

>
>
> On Tue, 29 Sep 2009, Eric Dumazet wrote:
> >
> > Checking this commit, I believe problem comes from cmpxchg(), which
> > doesnt handle 64 bit on X86_32 (no compilation error, and null
> > operation :( )
>
> Wow. That's broken. Very nasty silent failure.
>
> > 1) Use cmpxchg64()
>
> Clearly better than what we have now, although cmpxchg64 does result
> is some really disgusting code. We will use a bare CMPXCHG64 only if
> you compile for PAE right now - so even if you tell Kconfig that you
> want to compile for a modern CPU, we won't be doing that whole
> cmpxchg64b thing directly, we'll inline some really disgusting code.

can't we use alternatives() for this, to patch cmpxchg64 in ?
I mean.. it'll be commonly supported nowadays.. the fallback to it not
being supported could be a bit slower by now...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Tue, 29 Sep 2009, Arjan van de Ven wrote:
>
> can't we use alternatives() for this, to patch cmpxchg64 in ?
> I mean.. it'll be commonly supported nowadays.. the fallback to it not
> being supported could be a bit slower by now...

Yes, we could. It would limit us to some fixed address format, probably

cmpxchg8b (%esi)

or something. Use something like this as a starting point, perhaps?

NOTE! Totally untested! And you'd actually need to implement the
"cmpxchg8b_emu" function that takes it's arguments in %eax:%edx, %ebx:%ecx
and %esi and doesn't trash any other registers..

Linus

---
arch/x86/include/asm/cmpxchg_32.h | 27 ++++++++++++++-------------
1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index 82ceb78..d16a486 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -312,19 +312,20 @@ static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,

extern unsigned long long cmpxchg_486_u64(volatile void *, u64, u64);

-#define cmpxchg64(ptr, o, n) \
-({ \
- __typeof__(*(ptr)) __ret; \
- if (likely(boot_cpu_data.x86 > 4)) \
- __ret = (__typeof__(*(ptr)))__cmpxchg64((ptr), \
- (unsigned long long)(o), \
- (unsigned long long)(n)); \
- else \
- __ret = (__typeof__(*(ptr)))cmpxchg_486_u64((ptr), \
- (unsigned long long)(o), \
- (unsigned long long)(n)); \
- __ret; \
-})
+#define cmpxchg64(ptr, o, n) \
+({ \
+ __typeof__(*(ptr)) __ret; \
+ __typeof__(*(ptr)) __old = (o); \
+ __typeof__(*(ptr)) __new = (n); \
+ alternative_io(LOCK_PREFIX "cmpxchg8b (%%esi)", \
+ "call cmpxchg8b_emu", \
+ X86_FEATURE_CMPXCHG8B, \
+ "=A" (__ret), \
+ "m" (*(ptr)), "0" (__old), \
+ "b" ((unsigned int)__new), \
+ "c" ((unsigned int)(__new>>32)) ); \
+ __ret; })
+
#define cmpxchg64_local(ptr, o, n) \
({ \
__typeof__(*(ptr)) __ret; \
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arjan van de Ven on
On Tue, 29 Sep 2009 14:56:28 -0700 (PDT)
Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

>
>
> On Tue, 29 Sep 2009, Arjan van de Ven wrote:
> >
> > can't we use alternatives() for this, to patch cmpxchg64 in ?
> > I mean.. it'll be commonly supported nowadays.. the fallback to it
> > not being supported could be a bit slower by now...
>
> Yes, we could. It would limit us to some fixed address format,
> probably
>
> cmpxchg8b (%esi)
>
> or something. Use something like this as a starting point, perhaps?
>
> NOTE! Totally untested! And you'd actually need to implement the
> "cmpxchg8b_emu" function that takes it's arguments in %eax:%edx,
> %ebx:%ecx and %esi and doesn't trash any other registers..

so I debugged this guy (had a few bugs ;-)

patch, including a new cmpxchg8b_emu below:

From 5a76986c5dd272ea16a9b8abb7349ff3d6791c2b Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan(a)linux.intel.com>
Date: Wed, 30 Sep 2009 17:04:35 +0200
Subject: [PATCH] x86: Provide an alternative() based cmpxchg64()

Based on Linus' patch, this patch provides cmpxchg64() using
the alternative() infrastructure.

Note: the fallback is NOT smp safe, just like the current fallback
is not SMP safe.

Signed-off-by: Arjan van de Ven <arjan(a)linux.intel.com>
---
arch/x86/include/asm/cmpxchg_32.h | 29 ++++++++++--------
arch/x86/kernel/i386_ksyms_32.c | 3 ++
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/cmpxchg8b_emu.S | 61 +++++++++++++++++++++++++++++++++++++
4 files changed, 81 insertions(+), 14 deletions(-)
create mode 100644 arch/x86/lib/cmpxchg8b_emu.S

diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index 82ceb78..3b21afa 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -312,19 +312,22 @@ static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,

extern unsigned long long cmpxchg_486_u64(volatile void *, u64, u64);

-#define cmpxchg64(ptr, o, n) \
-({ \
- __typeof__(*(ptr)) __ret; \
- if (likely(boot_cpu_data.x86 > 4)) \
- __ret = (__typeof__(*(ptr)))__cmpxchg64((ptr), \
- (unsigned long long)(o), \
- (unsigned long long)(n)); \
- else \
- __ret = (__typeof__(*(ptr)))cmpxchg_486_u64((ptr), \
- (unsigned long long)(o), \
- (unsigned long long)(n)); \
- __ret; \
-})
+#define cmpxchg64(ptr, o, n) \
+({ \
+ __typeof__(*(ptr)) __ret; \
+ __typeof__(*(ptr)) __old = (o); \
+ __typeof__(*(ptr)) __new = (n); \
+ alternative_io("call cmpxchg8b_emu", \
+ "lock; cmpxchg8b (%%esi)" , \
+ X86_FEATURE_CX8, \
+ "=A" (__ret), \
+ "S" ((ptr)), "0" (__old), \
+ "b" ((unsigned int)__new), \
+ "c" ((unsigned int)(__new>>32))); \
+ __ret; })
+
+
+
#define cmpxchg64_local(ptr, o, n) \
({ \
__typeof__(*(ptr)) __ret; \
diff --git a/arch/x86/kernel/i386_ksyms_32.c b/arch/x86/kernel/i386_ksyms_32.c
index 43cec6b..f17930e 100644
--- a/arch/x86/kernel/i386_ksyms_32.c
+++ b/arch/x86/kernel/i386_ksyms_32.c
@@ -10,6 +10,9 @@
EXPORT_SYMBOL(mcount);
#endif

+extern void cmpxchg8b_emu(void); /* dummy proto */
+EXPORT_SYMBOL(cmpxchg8b_emu);
+
/* Networking helper routines. */
EXPORT_SYMBOL(csum_partial_copy_generic);

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 9e60920..3e549b8 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -15,7 +15,7 @@ ifeq ($(CONFIG_X86_32),y)
obj-y += atomic64_32.o
lib-y += checksum_32.o
lib-y += strstr_32.o
- lib-y += semaphore_32.o string_32.o
+ lib-y += semaphore_32.o string_32.o cmpxchg8b_emu.o

lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
diff --git a/arch/x86/lib/cmpxchg8b_emu.S b/arch/x86/lib/cmpxchg8b_emu.S
new file mode 100644
index 0000000..b8af4c7
--- /dev/null
+++ b/arch/x86/lib/cmpxchg8b_emu.S
@@ -0,0 +1,61 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ */
+
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/frame.h>
+#include <asm/dwarf2.h>
+
+
+.text
+
+/*
+ * Inputs:
+ * %esi : memory location to compare
+ * %eax : low 32 bits of old value
+ * %edx : high 32 bits of old value
+ * %ebx : low 32 bits of new value
+ * %ecx : high 32 bits of new value
+ */
+ENTRY(cmpxchg8b_emu)
+ CFI_STARTPROC
+
+ push %edi
+ push %ebx
+ push %ecx
+ /* disable interrupts */
+ pushf
+ pop %edi
+ cli
+
+ cmpl %edx, 4(%esi)
+ jne 1f
+ cmpl %eax, (%esi)
+ jne 1f
+
+ xchg (%esi), %ebx
+ xchg 4(%esi), %ecx
+ mov %ebx, %eax
+ mov %ecx, %edx
+
+2:
+ /* restore interrupts */
+ push %edi
+ popf
+
+ pop %ecx
+ pop %ebx
+ pop %edi
+ ret
+1:
+ mov (%esi), %eax
+ mov 4(%esi), %edx
+ jmp 2b
+ CFI_ENDPROC
+ENDPROC(cmpxchg8b_emu)
+
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/