From: Denys Vlasenko on
On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> The put_dec_trunc() and put_dec_full() functions were based on
> a code optimised for processors with 8-bit ALU but even then
> they failed to satisfy the same constraints

"Failed"? Interesting wording. Yes, the code won't map easily
onto 8-bit ALU, for the simple reason Linux kernel
does not support any 8-bit CPUs, and by going to wider register
I was able to process 5 decimal digits at once, not 4.
It was done deliberately. It is not a "failure".

Your code isn't 8-bit ALU optimized either.

Do you think a bit of smear of previous code
would help your to be accepted?

> and in fact
> required at least 16-bit ALU (because at least one number they
> operate in can take 9 bits).

Yes, as explained above.

> This version of those functions proposed by this patch goes
> further and uses the full capacity of a 32-bit ALU and instead
> of splitting the number into nibbles and operating on them it
> performs the obvious algorithm for base conversion expect it
> uses optimised code for dividing by ten (ie. no division is
> actually performed).

(1) "expect" is a typo
(2) No, _this_ patch does not eliminate division. Next one does.
Move this part of changelong to the next patch, where it belongs.


> + * Decimal conversion is by far the most typical, and is used for
> + * /proc and /sys data. This directly impacts e.g. top performance
> + * with many processes running.
> + *
> + * We optimize it for speed using ideas described at
> + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.

Do you have author's permission to do it?
Document it in the comment please.


> + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
> + * correct results for num < 81920. Because of this, we check at the
> + * beginning if we are dealing with a number that may cause trouble
> + * and if so, we make it smaller.

This comment needs to be moved to the code line where the opration
is performed.


> + * (As a minor note, all operands are always 16 bit so this function
> + * should work well on hardware that cannot multiply 32 bit numbers).
> + *
> + * (Previous a code based on

English is a bit broken in the line above.


--
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michal Nazarewicz on
Denys Vlasenko <vda.linux(a)googlemail.com> writes:

> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>> The put_dec_trunc() and put_dec_full() functions were based on
>> a code optimised for processors with 8-bit ALU but even then
>> they failed to satisfy the same constraints
>
> "Failed"? Interesting wording. Yes, the code won't map easily
> onto 8-bit ALU, for the simple reason Linux kernel
> does not support any 8-bit CPUs, and by going to wider register
> I was able to process 5 decimal digits at once, not 4.
> It was done deliberately. It is not a "failure".

You're probably right. ;) I'll change the wording in the next patch.

> Your code isn't 8-bit ALU optimized either.

Yep, that was not my intention.

> Do you think a bit of smear of previous code
> would help your to be accepted?

I'm sorry if I sounded offencive, I just wanted to point out that we
don't have to and we in fact do not limit ourselves to 8-bit ALUs.

>> and in fact
>> required at least 16-bit ALU (because at least one number they
>> operate in can take 9 bits).
>
> Yes, as explained above.
>
>> This version of those functions proposed by this patch goes
>> further and uses the full capacity of a 32-bit ALU and instead
>> of splitting the number into nibbles and operating on them it
>> performs the obvious algorithm for base conversion expect it
>> uses optimised code for dividing by ten (ie. no division is
>> actually performed).
>
> (1) "expect" is a typo

Fixed.

> (2) No, _this_ patch does not eliminate division. Next one does.
> Move this part of changelong to the next patch, where it belongs.

Probably right, the part in parens can be misleading. Will fix.

>> + * Decimal conversion is by far the most typical, and is used for
>> + * /proc and /sys data. This directly impacts e.g. top performance
>> + * with many processes running.
>> + *
>> + * We optimize it for speed using ideas described at
>> + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
>
> Do you have author's permission to do it?
> Document it in the comment please.

OK, I'll send an email to Mr. Jones directly (unless he's actually
reading it, in which case, could you please give permission to use the
code?) In case of this patch I assumed the right to use because: (i) we
had permission for the previous code and (ii) I've used only one line of
code that is from the "Binary to Decimal Conversion in Limited
Precision" website (namely division). In case of the next patch, only
(i) applies.

>> + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
>> + * correct results for num < 81920. Because of this, we check at the
>> + * beginning if we are dealing with a number that may cause trouble
>> + * and if so, we make it smaller.
>
> This comment needs to be moved to the code line where the opration
> is performed.

Fixed.

>> + * (As a minor note, all operands are always 16 bit so this function
>> + * should work well on hardware that cannot multiply 32 bit numbers).
>> + *
>> + * (Previous a code based on
>
> English is a bit broken in the line above.

Fixed.

--
Best regards, _ _
.o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Denys Vlasenko on
On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
> -- Speed --------------------------------------------------------
> orig_put_dec_full : 1.054570 1.356214 1.732636 1.725760 Original
> mod1_put_dec_full : 1.000000 1.017216 1.255518 1.116559
> mod3_put_dec_full : 1.018222 1.000000 1.000000 1.000000 Proposed
>
> orig_put_dec_trunc : 1.137903 1.216017 1.850478 1.662370 Original
> mod1_put_dec_trunc : 1.000000 1.078154 1.355635 1.400637
> mod3_put_dec_trunc : 1.025989 1.000000 1.000000 1.000000 Proposed
> -- Size ---------------------------------------------------------
> orig_put_dec_full : 1.212766 1.310345 1.355372 1.355372 Original
> mod1_put_dec_full : 1.021277 1.000000 1.000000 1.000000
> mod3_put_dec_full : 1.000000 1.172414 1.049587 1.049587 Proposed
>
> orig_put_dec_trunc : 1.363636 1.317365 1.784000 1.784000 Original
> mod1_put_dec_trunc : 1.181818 1.275449 1.400000 1.400000
> mod3_put_dec_trunc : 1.000000 1.000000 1.000000 1.000000 Proposed

In my testing on Phenom II the speed gain is smaller,
but it is indeed faster. And smaller!


> + /*
> + * '(x * 0xcccd) >> 19' is an approximation of 'x / 10' that
> + * gives correct results for all x < 81920. However, because
> + * intermediate result can be at most 32-bit we limit x to be
> + * 16-bit.
> + *
> + * Because of those, we check if we are dealing with a "big"
> + * number and if so, we make it smaller remembering to add to
> + * the most significant digit.
> + */
> + if (q >= 50000) {
> + a = '5';
> + q -= 50000;
....
> + /*
> + * We need to check if q is < 65536 so we might as well check

You meant "need to check if q is < 81920"?

> + * if we can just call the _full version of this function.
> + */
> + if (q > 9999)
> + return put_dec_full(buf, q);

--
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/