From: Borislav Petkov on
On Mon, Feb 08, 2010 at 01:35:41AM -0800, H. Peter Anvin wrote:
> On 02/08/2010 01:28 AM, Borislav Petkov wrote:
> >
> >Well, in the second version I did replace a 'call _hweightXX' with
> >the actual popcnt opcode so the alternatives is only needed to do the
> >replacement during boot. We might just as well do
> >
> >if (X86_FEATURE_POPCNT)
> > __hw_popcnt()
> >else
> > __software_hweight()
> >
> >The only advantage of the alternatives is that it would save us the
> >if-else test above each time we do cpumask_weight. However, the if-else
> >approach is much more readable and obviates the need for all that macro
> >magic and taking special care of calling c function from within asm. And
> >since we do not call cpumask_weight all that often I'll honestly opt for
> >alternative-less solution...
> >
>
> The highest performance will be gotten by alternatives, but it only
> make sense if they are inlined at the point of use... otherwise it's
> basically pointless.

The popcnt-replacement part of the alternative would be as fast as
possible since we're adding the opcode there but the slow version would
add the additional overhead of saving/restoring the registers before
calling the software hweight implementation. I'll do some tracing to see
what a change like that would cost on machines which don't have popcnt.

Let me prep another version when I get back on Wed. (currently
travelling) with all the stuff we discussed to see how it would turn.

Thanks,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> Let me prep another version when I get back on Wed. (currently
> travelling) with all the stuff we discussed to see how it would turn.

Ok, here's another version ontop of PeterZ's patch at
http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
while on 64-bit I do "popcnt %rdi, %rdi".

I also did some rudimentary tracing with the function graph tracer of
all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
compile and the preliminary results show that hweight in software takes
about 9.768 usecs the longest while the hardware popcnt about 8.515
usecs. The machine is a Fam10 revB2 quadcore.

What remains to be done is see whether the saving/restoring of
callee-clobbered regs with this patch has any noticeable negative
effects on the software hweight case on machines which don't support
popcnt. Also, I'm open for better tracing ideas :).

Thanks.

---
arch/alpha/include/asm/bitops.h | 4 +
arch/ia64/include/asm/bitops.h | 1 +
arch/sparc/include/asm/bitops_64.h | 4 +
arch/x86/include/asm/bitops.h | 10 +++
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/hweight.c | 96 ++++++++++++++++++++++++++++
include/asm-generic/bitops/arch_hweight.h | 8 +-
include/asm-generic/bitops/const_hweight.h | 43 +++++++++++-
lib/hweight.c | 20 +++---
9 files changed, 169 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
{
return __kernel_ctpop(w);
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_weight32(unsigned int w)
{
return __arch_hweight64(w);
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
return __arch_hweight64(w & 0xffff);
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
return __arch_hweight64(w & 0xff);
}
+#define __arch_hweight8 __arch_hweight8
#else
#include <asm-generic/bitops/arch_hweight.h>
#endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
result = ia64_popcnt(x);
return result;
}
+#define __arch_hweight64 __arch_hweight64

#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
return res;
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_hweight32(unsigned int w)
{
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
return res;
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
return res;
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
return res;
}
+#define __arch_hweight8 __arch_hweight8

#else

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
#include <asm-generic/bitops/hweight.h>

#endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

ifeq ($(CONFIG_X86_32),y)
obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..4a1c5c9
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,96 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+/* popcnt %eax, %eax */
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+
+/* popcnt %rdi, %rdi */
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define PUSH_CLOBBERED_32 \
+ "pushl %%edx\n\t" \
+ "pushl %%ecx\n\t"
+
+#define POP_CLOBBERED_32 \
+ "\n\t" \
+ "popl %%ecx\n\t" \
+ "popl %%edx\n\t"
+
+#define PUSH_CLOBBERED_64 \
+ "pushq %%rax\n\t" \
+ "pushq %%rsi\n\t" \
+ "pushq %%rdx\n\t" \
+ "pushq %%rcx\n\t" \
+ "pushq %%r8\n\t" \
+ "pushq %%r9\n\t" \
+ "pushq %%r10\n\t" \
+ "pushq %%r11\n\t"
+
+#define POP_CLOBBERED_64 \
+ "\n\t" \
+ "popq %%r11\n\t" \
+ "popq %%r10\n\t" \
+ "popq %%r9\n\t" \
+ "popq %%r8\n\t" \
+ "popq %%rcx\n\t" \
+ "popq %%rdx\n\t" \
+ "popq %%rsi\n\t" \
+ "popq %%rax\n\t"
+
+#ifdef CONFIG_64BIT
+#define PUSH_CLOBBERED PUSH_CLOBBERED_64
+#define POP_CLOBBERED POP_CLOBBERED_64
+#define POPCNT POPCNT64
+#define ARG0 "D"
+#else
+#define PUSH_CLOBBERED PUSH_CLOBBERED_32
+#define POP_CLOBBERED POP_CLOBBERED_32
+#define POPCNT POPCNT32
+#define ARG0 "a"
+#endif
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm volatile(PUSH_CLOBBERED
+ ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ POP_CLOBBERED
+ : "="ARG0 (res)
+ : ARG0 (w));
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm volatile(PUSH_CLOBBERED
+ ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ POP_CLOBBERED
+ : "="ARG0 (res)
+ : ARG0 (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);

#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))

+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+ return __arch_hweight8(w);
+#else
+ return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+ return __arch_hweight16(w);
+#else
+ return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+ return __arch_hweight32(w);
+#else
+ return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+ return __arch_hweight64(w);
+#else
+ return __sw_hweight64(w);
+#endif
+}
/*
* Generic interface.
*/
-#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))

/*
* Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
--
1.6.6.1


--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> You don't do the push/pop inline -- if you're going to take the hit of
> pushing this into the caller, it's better to list them as explicit
> clobbers and let the compiler figure out how to do it. The point of
> doing an explicit push/pop is that it can be pushed into the out-of-line
> subroutine.

Doh! Of course, this was wrong. See below.

> Furthermore, you're still putting "volatile" on there... this is a pure
> computation -- no side effects -- so it is exactly when you *shouldn't*
> declare your asm statement volatile.

Also gone.

> Note: given how simple and regular a popcnt actually is, it might be
> preferrable to have the out-of-line implementation either in assembly,
> or using gcc's -fcall-saved-* options to reduce the number of registers
> that is clobbered by the routine.

Yep, this is actually a very good idea. And it seems to work, I've
tested it on an F10, K8 and even on an Atom and no blow ups so far...

---
arch/alpha/include/asm/bitops.h | 4 ++
arch/ia64/include/asm/bitops.h | 1 +
arch/sparc/include/asm/bitops_64.h | 4 ++
arch/x86/include/asm/bitops.h | 10 ++++
arch/x86/lib/Makefile | 8 +++-
arch/x86/lib/hweight.c | 72 ++++++++++++++++++++++++++++
include/asm-generic/bitops/arch_hweight.h | 8 ++--
include/asm-generic/bitops/const_hweight.h | 43 +++++++++++++++--
lib/hweight.c | 20 ++++----
9 files changed, 151 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
{
return __kernel_ctpop(w);
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_weight32(unsigned int w)
{
return __arch_hweight64(w);
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
return __arch_hweight64(w & 0xffff);
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
return __arch_hweight64(w & 0xff);
}
+#define __arch_hweight8 __arch_hweight8
#else
#include <asm-generic/bitops/arch_hweight.h>
#endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
result = ia64_popcnt(x);
return result;
}
+#define __arch_hweight64 __arch_hweight64

#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
return res;
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_hweight32(unsigned int w)
{
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
return res;
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
return res;
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
return res;
}
+#define __arch_hweight8 __arch_hweight8

#else

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
#include <asm-generic/bitops/hweight.h>

#endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..ddc32eb 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,9 +22,11 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

ifeq ($(CONFIG_X86_32),y)
+ CFLAGS_hweight.o = -fcall-saved-ecx -fcall-saved-edx
+
obj-y += atomic64_32.o
lib-y += checksum_32.o
lib-y += strstr_32.o
@@ -34,6 +36,10 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
+ CFLAGS_hweight.o = -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx \
+ -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 \
+ -fcall-saved-r11
+
obj-y += io_64.o iomap_copy_64.o
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
lib-y += thunk_64.o clear_page_64.o copy_page_64.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..9b4bfa5
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,72 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * Those are called out-of-line in the alternative below and are added here only
+ * so that gcc is able to figure out which registers have been clobbered by
+ * __sw_hweightXX so that it could restore their values before returning from
+ * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
+ */
+unsigned int _sw_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}
+
+unsigned long _sw_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call _sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);

#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))

+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+ return __arch_hweight8(w);
+#else
+ return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+ return __arch_hweight16(w);
+#else
+ return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+ return __arch_hweight32(w);
+#else
+ return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+ return __arch_hweight64(w);
+#else
+ return __sw_hweight64(w);
+#endif
+}
/*
* Generic interface.
*/
-#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))

/*
* Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
--
1.6.6.1


--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
On Fri, Feb 12, 2010 at 09:28:32AM -0800, H. Peter Anvin wrote:
> On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> > On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> >> You don't do the push/pop inline -- if you're going to take the hit of
> >> pushing this into the caller, it's better to list them as explicit
> >> clobbers and let the compiler figure out how to do it. The point of
> >> doing an explicit push/pop is that it can be pushed into the out-of-line
> >> subroutine.
> >
> > Doh! Of course, this was wrong. See below.
> >
> > diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> > new file mode 100644
> > index 0000000..9b4bfa5
> > --- /dev/null
> > +++ b/arch/x86/lib/hweight.c
> > @@ -0,0 +1,72 @@
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/bitops.h>
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> > +/* popcnt %eax, %eax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> > +#define REG_IN "a"
> > +#define REG_OUT "a"
> > +#endif
> > +
> > +/*
> > + * Those are called out-of-line in the alternative below and are added here only
> > + * so that gcc is able to figure out which registers have been clobbered by
> > + * __sw_hweightXX so that it could restore their values before returning from
> > + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> > + */
> > +unsigned int _sw_hweight32(unsigned int w)
> > +{
> > + return __sw_hweight32(w);
> > +}
> > +
> > +unsigned long _sw_hweight64(__u64 w)
> > +{
> > + return __sw_hweight64(w);
> > +}
> > +
> > +unsigned int __arch_hweight32(unsigned int w)
> > +{
> > + unsigned int res = 0;
> > +
> > + asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> > + : "="REG_OUT (res)
> > + : REG_IN (w));
> > +
>
> Now you have no clobbers and no assembly wrapper. That really doesn't work.

I see, this means we have to build lib/hweight.c with -fcall-saved*. I
just did a test and it uses %rcx and %rdx temporarily by saving their
values on the stack and restoring them before it returns.

However, this is generic code and for the above to work we have to
enforce x86-specific CFLAGS for it. What is the preferred way to do
that?

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > Let me prep another version when I get back on Wed. (currently
> > travelling) with all the stuff we discussed to see how it would turn.
>
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

Right, so I don't like how you need to touch !x86 for this, and I think
that is easily avoidable by not making x86 include
asm-generic/bitops/arch_hweight.h.

If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
and simply provide __arch_hweightN() from x86/include/asm/bitops.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/