From: tomstdenis@gmail.com on
[quick blurb]. TomsFastMath is my public domain port of LibTomMath
specifically for fast RSA/DH/ECC math. It has been ported to x86_32,
x86_64, SSE2 and ARMv4 in previous releases. URL:
http://libtomcrypt.org/tfm/index.html

Today [just got back from France on Saturday ... hehehe] I've sped up
my Squaring code by removing redundant "doubling" steps ... Essentially
I was doing n^2 doublings when only n doublings were required.
Doubling isn't hard [three adds] but it adds up [excuse the pun] and
we're about speed here ;-)

Some numbers on two Athlons [64 and XP] in cycles per operation [groups
of 10000 operations take the min time]

OLD 64-bit...

Squaring:
256-bit: 89
512-bit: 234
1024-bit: 815
2048-bit: 2851

NEW 64-bit ...

Squaring:
256-bit: 89
512-bit: 228
1024-bit: 691
2048-bit: 2228

OLD 32-bit

Squaring:

256-bit: 327
512-bit: 1044
1024-bit: 3646
2048-bit: 17055

NEW 32-bit

Squaring:

256-bit: 332
512-bit: 894
1024-bit: 2983
2048-bit: 15197

Amongst other things I want to speed up the generic and Karat code...
but generally the 32/64-bit performance is good (1024-bit squaring for
instance is used in CRT based RSA-2048).

In particular I may add a 64x routine since that would speed up
2048-bit mul/sqr operations BIG TIME. The code expansion is huge but
new cpus have huge caches for a reason [and TomsFastMath is tunable].

On the AMD64 I was *already* faster than OpenSSL by [iirc] around 38%
in exptmod timings. Since squaring is the second most dominant step in
exptmod (you do n-squarings for a n-bit exptmod) this will speed it up.
on my AMD64 I just measured

NEW
Exptmod:
512-bit: 616,647 [~40% faster than OpenSSL]
1024-bit: 3,075,211
2048-bit: 19,613,160

Whereas with TFM 0.02 I got
Exptmod:
512: 641,743
1024: 3,167,406
2048: 20,158,609

I can likely get better when I tune the generic routines a bit more.
Basically if you move outside of the nice powers of two the comba
routines slow down significantly [e.g. 320-bit squaring is much slower
than a 256-bit squaring and not much faster than a 512-bit squaring if
at all...]

What I'm looking for at the moment is a shell account on a P4 running a
recent distro of linux + GCC [preferably GCC 3.4.3 ;-)]. I can test
the code on my AMD64 with SSE2 but it's obviously not the same as
testing on a real P4 [for speed I mean]. Similarly if anyone has
access to an ARM box I can use that would be great. Otherwise I'll try
setting up a test on my gameboy again [which is annoying cuz I have to
go into windows...]

Right now I have to write and test the SSE2 and ARMv4 ports of the new
squaring code. Then I'll clean up the code a bit and release TFM 0.03
;-)

Tom

From: Jean-Luc Cooke on
tomstdenis(a)gmail.com <tomstdenis(a)gmail.com> wrote:
> [quick blurb]. TomsFastMath is my public domain port of LibTomMath
> specifically for fast RSA/DH/ECC math. It has been ported to x86_32,
> x86_64, SSE2 and ARMv4 in previous releases. URL:
> http://libtomcrypt.org/tfm/index.html

> Today [just got back from France on Saturday ... hehehe] I've sped up
> my Squaring code by removing redundant "doubling" steps ... Essentially
> I was doing n^2 doublings when only n doublings were required.
> Doubling isn't hard [three adds] but it adds up [excuse the pun] and
> we're about speed here ;-)

Shouldn't it be only 2 adds? Or a shift? Or am I thinking too hard
about this?

> Some numbers on two Athlons [64 and XP] in cycles per operation [groups
> of 10000 operations take the min time]

> OLD 64-bit...

> Squaring:
> 256-bit: 89
> 512-bit: 234
> 1024-bit: 815
> 2048-bit: 2851

> NEW 64-bit ...

> Squaring:
> 256-bit: 89
> 512-bit: 228
> 1024-bit: 691
> 2048-bit: 2228

> OLD 32-bit

> Squaring:

> 256-bit: 327
> 512-bit: 1044
> 1024-bit: 3646
> 2048-bit: 17055

> NEW 32-bit

> Squaring:

> 256-bit: 332
> 512-bit: 894
> 1024-bit: 2983
> 2048-bit: 15197

> Amongst other things I want to speed up the generic and Karat code...
> but generally the 32/64-bit performance is good (1024-bit squaring for
> instance is used in CRT based RSA-2048).

> In particular I may add a 64x routine since that would speed up
> 2048-bit mul/sqr operations BIG TIME. The code expansion is huge but
> new cpus have huge caches for a reason [and TomsFastMath is tunable].

> On the AMD64 I was *already* faster than OpenSSL by [iirc] around 38%
> in exptmod timings. Since squaring is the second most dominant step in
> exptmod (you do n-squarings for a n-bit exptmod) this will speed it up.
> on my AMD64 I just measured

> NEW
> Exptmod:
> 512-bit: 616,647 [~40% faster than OpenSSL]
> 1024-bit: 3,075,211
> 2048-bit: 19,613,160

> Whereas with TFM 0.02 I got
> Exptmod:
> 512: 641,743
> 1024: 3,167,406
> 2048: 20,158,609

> I can likely get better when I tune the generic routines a bit more.
> Basically if you move outside of the nice powers of two the comba
> routines slow down significantly [e.g. 320-bit squaring is much slower
> than a 256-bit squaring and not much faster than a 512-bit squaring if
> at all...]

> What I'm looking for at the moment is a shell account on a P4 running a
> recent distro of linux + GCC [preferably GCC 3.4.3 ;-)]. I can test
> the code on my AMD64 with SSE2 but it's obviously not the same as
> testing on a real P4 [for speed I mean]. Similarly if anyone has
> access to an ARM box I can use that would be great. Otherwise I'll try
> setting up a test on my gameboy again [which is annoying cuz I have to
> go into windows...]

> Right now I have to write and test the SSE2 and ARMv4 ports of the new
> squaring code. Then I'll clean up the code a bit and release TFM 0.03
> ;-)

> Tom


--
From: tomstdenis@gmail.com on

Jean-Luc Cooke wrote:
> tomstdenis(a)gmail.com <tomstdenis(a)gmail.com> wrote:
> > [quick blurb]. TomsFastMath is my public domain port of LibTomMath
> > specifically for fast RSA/DH/ECC math. It has been ported to
x86_32,
> > x86_64, SSE2 and ARMv4 in previous releases. URL:
> > http://libtomcrypt.org/tfm/index.html
>
> > Today [just got back from France on Saturday ... hehehe] I've sped
up
> > my Squaring code by removing redundant "doubling" steps ...
Essentially
> > I was doing n^2 doublings when only n doublings were required.
> > Doubling isn't hard [three adds] but it adds up [excuse the pun]
and
> > we're about speed here ;-)
>
> Shouldn't it be only 2 adds? Or a shift? Or am I thinking too hard
> about this?

It's a three-word carry that acts as an open ended shift register
throughout the product.

In LTM i do two-word ops but I also don't fill the variable [e.g.
28-bit digits]. In TFM I use asm inlines so I can easily use carries
and hence three word carries.

Tom

From: Kevin Drapel on
< Otherwise I'll try
> setting up a test on my gameboy again [which is annoying cuz I have to
> go into windows...]
>

GBA or old Gameboy ? There are flash tools for the GBA on Linux :

http://www.gameboy-advance.net/fal_soft/gba_flash_2_advance_linux.htm
From: tomstdenis@gmail.com on

Kevin Drapel wrote:
> < Otherwise I'll try
> > setting up a test on my gameboy again [which is annoying cuz I have
to
> > go into windows...]
> >
>
> GBA or old Gameboy ? There are flash tools for the GBA on Linux :
>
> http://www.gameboy-advance.net/fal_soft/gba_flash_2_advance_linux.htm

I found most linux tools don't work well out of the box [mostly the
crt/libc isn't built for the memory model of an actual GBA].

So I use an older win32 build of devkitadv for my GBA dev.

Tom