From: Thomas Pornin on
According to Maaartin <grajcar1(a)seznam.cz>:
> Btw., did you try a native Java compiler? There is a couple of them
> available...

I tried the one I wrote -- which translates JVM bytecode to C, and then
runs a C compiler. Performance was somewhat lower than what C code
achieves, mainly due to the array bounds checks.


> That's really bad news for Java.

That's not news, and that's not really bad either. A factor of 2.5 or 3
is a worst case, when crunching data which is all in L1 cache. Most
applications are limited by I/O (you cannot hash data faster than you
actually get it, and 100baseT network tops at about 10 MB/s -- so even
if CubeHash is not the fastest of hash functions, its performance is
still adequate in many situations). In the few applications where pure
processing speed matters, it actually matters for a very small portion
of the code only. Java strives on the 99.9% of the code where you do
not need cycle-to-cycle records.


> Cubehash is very simple, but what normal algorithm would profit from
> 32+ registers?

CubeHash is an extreme case with regards to register usage.

In the early 90's, when DEC developped the first Alpha processor, they
ran some simulations, and found out that while having 32 registers instead
of 16 was offered significant gains, going from 32 to a whooping 256
registers increased performance only marginally. Thus the Alpha got 32
registers.


> Could you provide me the cubehash.s created by something like
> gcc -S -O1 -m64 cubehash.c
> ? I'm an unfortunate winblow$ user and I can't get -m64 using cygwin.

You could try Visual C. If you have a 64-bit Windows, you can get the
"Windows SDK" (for free) and it comes with the command-line versions
of Visual C for 64-bit architectures.

I will try to produce your file.


> What do you think about the success chances of the manual instruction
> reordering?

Anything can happen. The problem of producing optimal code is hard
so compilers tend to rely on heuristics which produce "good" code
"most of the time". Those heuristics are rarely documented, so
manual optimization looks like a hit-and-miss game until you happen
to find a situation where the compiler does not do anything stupid.


--Thomas Pornin
From: Maaartin on
Ignoring the swaps, there are 6*32 operations in one round of
CubeHash, which means at least 64 cycles on a 3-way superscalar
processor. There are 16 rounds for each 32 input bytes, which makes 32
cycles per byte. On your 64-bit system, your C version runs with 39.84
cycles per byte and your Java version runs with 53.05 cycles per byte.
On my computer it was slower before I applied the trivial
optimization, now it's the same. I we managed to make the compiler do
it right, it could run with about the same speed as in C. This should
be doable, but there's the 32 cycles/byte limit, so you're quite close
already. This limit makes me much less interested in the
optimizations...

Using XMM, it should run at about 12.5 cycles/byte (not exactly 4x as
fast because of missing rotation instructions), but that's not what
you're doing. This is valid only for CPUs capable of executing 3 XMM
instructions per second, which is AFAIK not the case for AMD
processors.
First  |  Prev  | 
Pages: 1 2
Prev: RSA Proof using CRTs
Next: learning again