Generic Crypto APIs ? [Cryptography]

Prev: Does entropy ever fall out of a good hash function?
Next: Introducing dynamics into block encryptions

From: Boon on 1 Mar 2010 10:43

Carsten Krueger wrote:

> If it's AES, compare it with Diskcryptor 0.9 (beta)
> It's the fastest AES version I'm aware of.

http://diskcryptor.net/index.php/DiskCryptor_en#Performance

104 MB/s @ 2.4 GHz = 23 cycles per byte

Intel's AES-specific instructions (AES-NI) are 5-10 times faster.

http://en.wikipedia.org/wiki/AES_instruction_set
http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set/

Regards.

From: Harold Johanssen on 1 Mar 2010 18:59

On Mon, 01 Mar 2010 16:43:00 +0100, Boon wrote:

> Carsten Krueger wrote:
>
>> If it's AES, compare it with Diskcryptor 0.9 (beta) It's the fastest
>> AES version I'm aware of.
>
> http://diskcryptor.net/index.php/DiskCryptor_en#Performance
>
> 104 MB/s @ 2.4 GHz = 23 cycles per byte
>
> Intel's AES-specific instructions (AES-NI) are 5-10 times faster.
>
> http://en.wikipedia.org/wiki/AES_instruction_set
> http://software.intel.com/en-us/articles/intel-advanced-encryption-
standard-aes-instructions-set/
>
> Regards.

I have heard of an implementation on Core 2, by Kasper and
Schwabe, getting 7.59 cycles per byte. That's nowhere near 5 times
faster. Are they not using the AES-NI instructions?

From: Paul Rubin on 1 Mar 2010 20:10

Harold Johanssen <noemail(a)please.net> writes:
> I have heard of an implementation on Core 2, by Kasper and
> Schwabe, getting 7.59 cycles per byte. That's nowhere near 5 times
> faster. Are they not using the AES-NI instructions?

That might be a bit-slice implementation; I've never heard of a
conventional one that fast. AES-NI is a Clarkdale+ feature (i.e. very
recent) and Core 2 doesn't have it.

From: Harold Johanssen on 2 Mar 2010 16:27

On Tue, 02 Mar 2010 16:43:15 +0100, Carsten Krueger wrote:

> Am Mon, 1 Mar 2010 23:59:40 +0000 (UTC) schrieb Harold Johanssen:
>
>> I have heard of an implementation on Core 2, by Kasper and
>> Schwabe, getting 7.59 cycles per byte.
>
> AES-128 (mode ?)

Counter mode.

>
> Discryptor does AES-256 XTS
>
> greetings
> Carsten

From: Thomas Pornin on 3 Mar 2010 08:52

According to Paul Rubin <no.email(a)nospam.invalid>:
> That might be a bit-slice implementation

It is -- but with only eight parallel instances; i.e. it encrypts data
by blocks of 128 bytes, whereas a "naive" bitslice implementation would
use 128 parallel AES instances (since XMM registers are 128-bit long),
i.e. 2048-byte blocks. Moreover, the announced cost of 7.59 cycles per
byte includes the data orthogonalization (bitslicing requires data elements
to be interleaved in registers). See the paper there:

http://www.cryptojedi.org/papers/aesbs-20090616.pdf

Note that such sycle counts are in ideal benchmark conditions, i.e. it
does not fully capture the influence of caches. In practical conditions,
the encryption code is integrated within some application, along a data
processing path, and the encryption code competes with the rest of the
application for cache usage. That AES implementation uses no table for
encryption (one of the versions uses no table either for key setup), so
that it uses very little L1 cache for data, and that's good (in the
article, they present it as a resistance against timing attacks, but
using very little data cache is good for practical performance). On the
other hand, the code footprint is a bit more than 12 kB (not counting
key setup), which is bearable (a typical x86 Intel has 32 kB L1 cache
for code) but may prove to be an issue in some code-cramped situations.
By comparison, a typical "normal" AES implementation will compile to
about 2 kB of code (not counting key setup) and 4 kB of constant data
(the tables).

So that while cycle counts in micro-benchmarks are important, nothing
really beats actual measures in a real situation.

One can still plausibly predict that AES-NI instructions should rock,
because not only they have great cycle counts (Intel promises about 1.3
cycle/byte on long runs) but they also lead to very compact
implementations (less than a hundred bytes of code, and no table). This
should also be good for PRNG.

Also it is my cue to point out that x86 hardware of the Core2-or-more
class is the least susceptible to have actual performance issues on
encryption. Limited hardware, such as what is found in cheap mobile
phones and in home routers and WiFi access points, is much more starved
on CPU, and is in the position to do crypto all day long. A 40$ home
router typically has a MiPS or ARM derivative with little cache (8 kB L1
cache for code on the Linksys router I have besides me), low power (200
MHz, and not super-scalar), no special AES instruction, no big registers
(no SSE2, no MMX, no FPU, only 32-bit general purpose registers) and yet
is hooked to a high-bandwidth network (54 Mbit/s WiFi, 100 Mbit/s
Ethernet). There are millions of such beasts out there.

In my view, performance on that kind of small hardware is industrially
much more significant than whatever AES-NI provides.

The same remarks can be done on SHA-3 candidates. Many provide good
performance but only with big code footprint (e.g. 20 kB) or with the
help of special instructions (such as SSE2 or AES-NI).

--Thomas Pornin

| Next | Last
Pages: 1 2
Prev: Does entropy ever fall out of a good hash function?
Next: Introducing dynamics into block encryptions