Prev: OctaOS
Next: DIV overflow
From: Wolfgang Kern on

Hi Guga,


> " SHL al,4 ;mul by 16 (entry size) and get rid of 030
> "
>
> i suppose that using:
> shl al 4
> shr al 4
>
> is faster then
>
> sub al 030
>
> right ?

Not at all, just by coincidence we can save on the
sub al 030 (AND al 0F would do it as well),
when we shift the four upper bits into nowhere :)

__
wolfgang


From: Wolfgang Kern on

Wannabee screv:


>> But.. how do i multiply a value by 10 without using mul ?
>>
>> Is there a way to use lea to multiply a number by 10 ? The usage of
>> lea is only to speed up a little.

> lea ecx D$ecx*4+ecx
> lea ecx D$ecx*2

Yes, this keeps all flags alive,
But Intel-CPUs use the slow shift mechanism for SIB-factorising.

I'd replace the *2 with LEA ecx D$ecx+ecx,
but you wont see much a difference anyway
due to address(LEA) and register(dependecy) stall penalties.

__
wolfgang



From: /o//annabee on
P� Thu, 29 Mar 2007 16:09:38 +0200, skrev Wolfgang Kern <nowhere(a)never.at>:

>
> Wannabee screv:
>
>
>>> But.. how do i multiply a value by 10 without using mul ?
>>>
>>> Is there a way to use lea to multiply a number by 10 ? The usage of
>>> lea is only to speed up a little.
>
>> lea ecx D$ecx*4+ecx
>> lea ecx D$ecx*2
>
> Yes, this keeps all flags alive,

I understood that.

> But Intel-CPUs use the slow shift mechanism for SIB-factorising.

ok. i understood (slow) and (intel)

> I'd replace the *2 with LEA ecx D$ecx+ecx,
> but you wont see much a difference anyway
> due to address(LEA) and register(dependecy) stall penalties.

I saw you used "add ecx ecx" for it above.
It did not occured to me that there
would be a diffrence between

LEA ecx D$ecx+ecx
and
lea ecx D$ecx*2

but I see they end up using diffrent
parts of the CPU logic, controller, gates ? (words i never use much)

I would have guess they ended up the same, internally.
since I thought multiplication was performed
allways via addition.

> __
> wolfgang
>
>
>

From: Wolfgang Kern on

Wannabee screv:
....
>> But Intel-CPUs use the slow shift mechanism for SIB-factorising.
> ok. i understood (slow) and (intel)

AMD's got faster shift hardware.

> > I'd replace the *2 with LEA ecx D$ecx+ecx,
> > but you wont see much a difference anyway
> > due to address(LEA) and register(dependecy) stall penalties.
>
> I saw you used "add ecx ecx" for it above.
> It did not occured to me that there
> would be a diffrence between
>
> LEA ecx D$ecx+ecx
> and
> lea ecx D$ecx*2
>
> but I see they end up using diffrent
> parts of the CPU logic, controller, gates ? (words i never use much)

This CPU-internal job-queus can work almost in parallel if they
can use different 'PIPES' (w/o dependencies of course).

> I would have guess they ended up the same, internally.
> since I thought multiplication was performed
> allways via addition.

Yes intMUL act with shift-add, but the SIB(*2^0..3) use just shift.

__
wolfgang



From: Wolfgang Kern on

Hello Guga,

< For example, if i have a decimal string as:
< Decimal:
< 48148617815154186478618618618258218 ....
< 687878489746514564564897848745174861831717821729

< The correct values i found are:
< [Value:
< Value.Conv32Bit: D$ 0A9549D21
< Value.Conv64Bit: D$ 056372845
< Value.Conv96Bit: D$ 0334CD7E6
< Value.Conv128Bit: D$ 0F2D05D75]

???

an 128 bit binary value is limited to:

integer(log(2)*MSbitNr)

0,301029 * 128 = 38 decimal digits

2^128 = 3,4028236692093846346337460743177... e+38

So you should limit the input to 38 digits and to be <2^128.

I didn't count how many digits you posted here,
but for this ascii-number you might need a 'very large' result buffer.

ie: you need a 1024-bit result for 308 decimal digits.

__
wolfgang



First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Prev: OctaOS
Next: DIV overflow