[PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec

Prev: [PATCH 44/67] tty: release BTM while sleeping in block_til_ready
Next: [PATCH 57/67] timbuart: use __devinit and __devexit macros for probe and remove

From: Michal Nazarewicz on 5 Aug 2010 18:40

The put_dec_trunc() and put_dec_full() functions were based on
a code optimised for processors with 8-bit ALU but even then
they failed to satisfy the same constraints and in fact
required at least 16-bit ALU (because at least one number they
operate in can take 9 bits).

This version of those functions proposed by this patch goes
further and uses the full capacity of a 32-bit ALU and instead
of splitting the number into nibbles and operating on them it
performs the obvious algorithm for base conversion expect it
uses optimised code for dividing by ten (ie. no division is
actually performed).

Signed-off-by: Michal Nazarewicz <mina86(a)mina86.com>
---
lib/vsprintf.c | 150 +++++++++++++++++++++++++++----------------------------
1 files changed, 74 insertions(+), 76 deletions(-)

I did some benchmark on the following three processors:

Phenom: AMD Phenom(tm) II X3 710 Processor (64-bit)
Atom: Intel(R) Atom(TM) CPU N270 @ 1.60GHz (32-bit)
ARM: ARMv7 Processor rev 2 (v7l) (32-bit)

Here are the results (normalised to the fastest/smallest):

: ARM Phonem Intel
-- Speed -------------------------------------------------------
orig_put_dec_full : 1.078600 1.777800 1.356917 Original
mod1_put_dec_full : 1.000000 1.117665 1.017742
mod3_put_dec_full : 1.032507 1.000000 1.000000 Proposed

orig_put_dec_trunc : 1.092177 1.657014 1.215658 Original
mod1_put_dec_trunc : 1.006836 1.395088 1.078385
mod3_put_dec_trunc : 1.000000 1.000000 1.000000 Proposed
-- Size --------------------------------------------------------
orig_put_dec_full : 1.212766 1.355372 1.310345 Original
mod1_put_dec_full : 1.021277 1.000000 1.000000
mod3_put_dec_full : 1.000000 1.049587 1.172414 Proposed

orig_put_dec_trunc : 1.363636 1.784000 1.317365 Original
mod1_put_dec_trunc : 1.181818 1.400000 1.275449
mod3_put_dec_trunc : 1.000000 1.000000 1.000000 Proposed

Source of the benchmark as well as code of all the modified version of
functions is included with the third patch of the benchmark.

As it can be observed from the table, the "mod3" version (proposed by
this patch) is the fastest version with the only exception of
"mod3_put_dec_full" on ARM which is slightly slower then
"mod1_put_dec_full" version.

It is also smaller, in terms of code size, then the original version
even though "mod1" is even smaller.

In the end, I'm proposing "mod3" because the size is not that
important (those are mere bytes) and as of speed, for ARM I have
proposed another solution in the next patch of this patchset.

The function is also shorter in terms of lines of code. ;)

I'm currently running 2.6.35 with this patch applied. It applies just
fine on -next as well but I haven't tested this kernel and I've run it
with -next on ARM.

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index b8a2f54..d63fb15 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -278,96 +278,94 @@ int skip_atoi(const char **s)
return i;
}

-/* Decimal conversion is by far the most typical, and is used
- * for /proc and /sys data. This directly impacts e.g. top performance
- * with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
-
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
+ * correct results for num < 81920. Because of this, we check at the
+ * beginning if we are dealing with a number that may cause trouble
+ * and if so, we make it smaller.
+ *
+ * (As a minor note, all operands are always 16 bit so this function
+ * should work well on hardware that cannot multiply 32 bit numbers).
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10: 1100111
+ * (x * 0x34) >> 9: 110100 - same
+ * (x * 0x1a) >> 8: 11010 - same
+ * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386)
+ */
static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full(char *buf, unsigned q)
{
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
-
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0'; /* least significant digit */
- d1 = q + 9*d3 + 5*d2 + d1;
- if (d1 != 0) {
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0'; /* next digit */
-
- d2 = q + 2*d2;
- if ((d2 != 0) || (d3 != 0)) {
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0'; /* next digit */
-
- d3 = q + 4*d3;
- if (d3 != 0) {
- q = (d3 * 0xcd) >> 11;
- d3 = d3 - 10*q;
- *buf++ = d3 + '0'; /* next digit */
- if (q != 0)
- *buf++ = q + '0'; /* most sign. digit */
- }
- }
+ unsigned r;
+ char a = '0';
+
+ if (q > 0xffff) {
+ a = '6';
+ q -= 60000;
}

+ r = (q * 0xcccd) >> 19;
+ *buf++ = (q - 10 * r) + '0';
+
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0';
+
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0';
+
+ q = (r * 0xd) >> 7;
+ *buf++ = (r - 10 * q) + '0';
+
+ *buf++ = q + a;
+
return buf;
}
-/* Same with if's removed. Always emits five digits */
+
+/* Same as above but do not pad with zeros. */
static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc(char *buf, unsigned q)
{
- /* BTW, if q is in [0,9999], 8-bit ints will be enough, */
- /* but anyway, gcc produces better code with full-sized ints */
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
+ unsigned r;

/*
- * Possible ways to approx. divide by 10
- * gcc -O2 replaces multiply with shifts and adds
- * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
- * (x * 0x67) >> 10: 1100111
- * (x * 0x34) >> 9: 110100 - same
- * (x * 0x1a) >> 8: 11010 - same
- * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386)
+ * We need to check if num is < 81920 so we might as well
+ * check if we can just call the _full version of this
+ * function.
*/
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0';
- d1 = q + 9*d3 + 5*d2 + d1;
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0';
-
- d2 = q + 2*d2;
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0';
-
- d3 = q + 4*d3;
- q = (d3 * 0xcd) >> 11; /* - shorter code */
- /* q = (d3 * 0x67) >> 10; - would also work */
- d3 = d3 - 10*q;
- *buf++ = d3 + '0';
- *buf++ = q + '0';
+ if (q > 9999)
+ return put_dec_full(buf, q);
+
+ r = (q * 0xcccd) >> 19;
+ *buf++ = (q - 10 * r) + '0';
+
+ if (r) {
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0';
+
+ if (q) {
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0';
+
+ if (r)
+ *buf++ = r + '0';
+ }
+ }

return buf;
}
+
/* No inlining helps gcc to use registers better */
static noinline_for_stack
char *put_dec(char *buf, unsigned long long num)
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [PATCH 44/67] tty: release BTM while sleeping in block_til_ready
Next: [PATCH 57/67] timbuart: use __devinit and __devexit macros for probe and remove

[PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()