From: Thomas Pornin on
Hello all,

sphlib-2.1 has been released:

http://www.saphir2.com/sphlib/

sphlib is a library of implementations of hash functions, in both C and
Java. The C version includes a variant optimized for small architectures
(those with about 8 kB of L1 cache). The C code also includes a
command-line tool which can act as a drop-in replacement for the md5sum
/ sha1sum / etc tools commonly found on Linux systems. The Java code is
compatible with J2ME (the "reduced Java" for mobile phones). A flexible
HMAC implementation is provided with the Java code.

sphlib-2.1 includes implementations for the fourteen second-round SHA-3
candidates, as well as a bunch of pre-SHA-3 functions (including SHA-1
and SHA-2). The archive contains (in its 'doc/' subdirectory) a report
on sphlib speed, as measured on a variety of architectures.

sphlib-2.1 is opensource (MIT-like license).


--Thomas Pornin
From: Maaartin on
Nice work! I managed to achieve a substantial speed up for Cubehash in
Java in a very trivial but surprising way: I extracted 2 methods from
the method sixteenRounds() by simply putting the first and second
halves each into a separate method. It seems like the JIT can't deal
with very large methods. I ran both versions several times on my
Phenom II X4 and gained always a factor of about 1.25.

original:
long messages -> 44.27 MBytes/s

my version:
long messages -> 56.13 MBytes/s
From: Maaartin on
I was a bit imprecise in my last post. Of course I didn't split the
whole method sixteenRounds(), but the loop body (so I created one
method for an even single round and one for an odd single round). Now,
I continued the method extraction and got a big slowdown this time. By
splitting the loop body into four parts (each corresponding with half
a round), I've got

long messages -> 35.65 MBytes/s

Could somebody confirm it? This behavior of Java is suitable both for
a bug report and for the DailyWTF.

In case it matters, I'm using
AMD Phenom(tm) II X4 920 Processor, 2800 MHz
Win Professional XP64 Version 2003, SP2
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
From: Maaartin on
Now I extracted the whole loop body, so I have

private final void sixteenRounds() {
for (int i = 0; i < 8; i ++) p();
}

and I get again

long messages -> 55.78 MBytes/s

Maybe this extraction prevented the variable i from needlessly
occupying a register, and this additionally available register allowed
for much better pipeline utilization.

I think some optimizations are possible by manually reordering the
instructions, which could be useful for C as well. I haven't had a
look at the code produced by the compiler, but I'd guess that it
doesn't do the reordering well enough. As you wrote, there's a lot of
parallelism available. I add, after the reordering there are many
operations possible on a subset of registers before the subset has to
be swapped to memory. But I haven't tried yet.
From: Thomas Pornin on
According to Maaartin <grajcar1(a)seznam.cz>:
> In case it matters, I'm using
> AMD Phenom(tm) II X4 920 Processor, 2800 MHz
> Win Professional XP64 Version 2003, SP2
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)

It would be interesting to bench the C code, too. "Morally", the
Java version cannot be faster than the C code, but CubeHash is one
of the functions where Java is closest to the C performance: on
my Intel Q6660 (2.4 GHz), in 64-bit mode, I get 60 MB/s with the C
code, and 45 MB/s with the Java implementation. That Java achieves
75% of the C speed is quite rare; for most functions (hash functions
and other computation-heavy codes I have tried), the speed of Java
is more typically between 30% and 50% of the speed achieved with
optimized C code.

(Note that I am talking here about computations occurring entirely in
the L1 cache; in "normal" code, memory, network or disk bandwidth
dominates, and Java fares as well as C or just any other language.)

Anyway, your Phenom should run faster than my Q6600 and the 44 MB/s you
get with my code is a bit too low. This looks like a misoptimization
from the JVM.

By the way, you may want to update your JVM. The current version from
Sun (Oracle) appears to be 1.6.0_20, and it fixes some bugs; it may also
include code generation improvements. Version 1.6.0_14 introduced
"extensive performance updates to the HotSpot JIT compiler" (or at least
so says Wikipedia). If you want to submit a bug report to Sun, the first
thing they will ask is whether you use the latest version (that is, if
they respond at all).


--Thomas Pornin
 |  Next  |  Last
Pages: 1 2
Prev: RSA Proof using CRTs
Next: learning again