From: Maaartin on
On Jun 18, 12:32 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> On 18 Tháng Sáu, 02:27, Maaartin <grajc...(a)seznam.cz> wrote:
> I'm not sure if I understand it correctly. What I've tried to say was
> the case of mixing a data word with a (sub-)key word in a block /
> stream cipher, which may be surely regarded as many-times-repeated
> multiplication by a in-memory constant value, and it is quite typical.

Typical? In IDEA there's a lot of multiplication but each multiplicand
gets used only once. The same for RC6.

> > Right. But multiplication is considered to be evil since on some
> > architectures it may lead to timing attack as well.
>
> "Evil" sounds somehow too strong to me. Multiplication is simply not
> suitable for these architectures -- legacy or low-end chips, I think.
> Just like memory lookup for architectures with non-constant memory
> access time.

Nearly everything is considered evil and the small architectures are
weighted more. That's why we get ARX and the powerful operations get
ignored. I believe they could lead to much faster algorithms, but
there's nobody analyzing them.

> I want to. I think it (the quadratic polynomial) is better than 2*u*z
> + u + z, since the later is linear in u and in z.

I though it was a mistake since the rest was IAFAIK the same as in
IDEA. I was concerned with how can you invert the operation, but now I
think you don't need to.

However, is there any good way how to invert x -> (2*x+1)*x? I know
that working from the LSb to the MSb is possible, it could be probably
sped up using some tables for processing more bits at once. Iterating
the operation many times must work, too, but how many times? Any
better idea?

> > > I has never implemented it, but I guess an optimized implementation
> > > takes about 15-20 clock cycles per byte on a modern processor.
>
> > That too slow, isn't it?
>
> Really? I don't think so. Serpent yields about 75 MB/s on my 2-core
> 2GHz machine (under a latest version of TrueCrypt). That is 2*2E9 /
> (75*1024*1024) = 51 clock cycles / byte.

Agreed, with AES in Truecrypt I get 450 MB/s on my 4 code AMD Phenom 2
with 2.8 GHz, which means 40.1 cycles / byte, but there's some
overhead for the disk encryption. The fastest AES implementation
possible on my CPU should take only about 10 cycles / byte.
From: Tran Ngoc Duong on
On Jun 19, 2:04 am, Maaartin <grajc...(a)seznam.cz> wrote:
> On Jun 18, 12:32 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
>
> > On 18 Tháng Sáu, 02:27, Maaartin <grajc...(a)seznam.cz> wrote:
> > I'm not sure if I understand it correctly. What I've tried to say was
> > the case of mixing a data word with a (sub-)key word in a block /
> > stream cipher, which may be surely regarded as many-times-repeated
> > multiplication by a in-memory constant value, and it is quite typical.
>
> Typical? In IDEA there's a lot of multiplication but each multiplicand
> gets used only once. The same for RC6.
>
Well, a subkey is used once to encrypt one plaintext block. It is used
repeatedly since there are many plaintext blocks to encrypt under it.

Anyway, you're right when you said that precomputation need not
necessarily save clock cycles since a memory read also costs
something.

> > I want to. I think it (the quadratic polynomial) is better than 2*u*z
> > + u + z, since the later is linear in u and in z.
>
> I though it was a mistake since the rest was IAFAIK the same as in
> IDEA. I was concerned with how can you invert the operation, but now I
> think you don't need to.
>
You're right. IDEA works similarly to Feistel ciphers in that we
compute the MA block in only forward direction, to encrypt and to
decrypt too. So the cipher works even if the MA block is not a
multipermutation (indeed, it isn't) and even if it is not a
permutation (actually it is). So we don't care to invert anything in
it, notably, the squaring operation.

> However, is there any good way how to invert x -> (2*x+1)*x? I know
> that working from the LSb to the MSb is possible, it could be probably
> sped up using some tables for processing more bits at once. Iterating
> the operation many times must work, too, but how many times? Any
> better idea?
>
I have no idea.

> > > > I has never implemented it, but I guess an optimized implementation
> > > > takes about 15-20 clock cycles per byte on a modern processor.
>
> > > That too slow, isn't it?
>
> > Really? I don't think so. Serpent yields about 75 MB/s on my 2-core
> > 2GHz machine (under a latest version of TrueCrypt). That is 2*2E9 /
> > (75*1024*1024) = 51 clock cycles / byte.
>
> Agreed, with AES in Truecrypt I get 450 MB/s on my 4 code AMD Phenom 2
> with 2.8 GHz, which means 40.1 cycles / byte, but there's some
> overhead for the disk encryption. The fastest AES implementation
> possible on my CPU should take only about 10 cycles / byte.
>
Would you please give me a reason why do you think that? There must be
some new AES implementation techniques -- new to me -- that I'm glad
to learn.
From: Maaartin on
On Jun 19, 12:05 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> On Jun 19, 2:04 am, Maaartin <grajc...(a)seznam.cz> wrote:> On Jun 18, 12:32 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
>
> Well, a subkey is used once to encrypt one plaintext block. It is used
> repeatedly since there are many plaintext blocks to encrypt under it.
>
> Anyway, you're right when you said that precomputation need not
> necessarily save clock cycles since a memory read also costs
> something.

Somehow, I didn't think past one block. Anyway, the precomputation may
or may not help. In you algorithm you need Z[i][0] and Z[i][3] to be
odd, and there's no reason not to assure it in the key schedule.

> You're right. IDEA works similarly to Feistel ciphers in that we
> compute the MA block in only forward direction, to encrypt and to
> decrypt too. So the cipher works even if the MA block is not a
> multipermutation (indeed, it isn't)

Multipermutations are quite rare, the only ones I know are based on
GF.

> and even if it is not a
> permutation (actually it is). So we don't care to invert anything in
> it, notably, the squaring operation.

OK. But the two multiplications in the group layer need to be
inverted, which requires division in the key schedule.

> > Agreed, with AES in Truecrypt I get 450 MB/s on my 4 code AMD Phenom 2
> > with 2.8 GHz, which means 40.1 cycles / byte, but there's some
> > overhead for the disk encryption. The fastest AES implementation
> > possible on my CPU should take only about 10 cycles / byte.
>
> Would you please give me a reason why do you think that? There must be
> some new AES implementation techniques -- new to me -- that I'm glad
> to learn.

There is a paper by Bernstein and Schwabe which claims about 169 clock
cycles for an AES block. The fastest AES I've found is a bit-sliced
implementation (by Kaesper and Schwabe) using SSE3 and achieving 7.08
cycles per byte on Intel Core i7 920 when processing 4KiB.

By switching to 64 bit you can get an algorithm working on 256-bit
blocks which takes on AMD64 about the same time per block. I wonder if
there's an cipher using the lower and upper parts of the product
separately.
From: Tran Ngoc Duong on
On Jun 19, 1:01 pm, Maaartin <grajc...(a)seznam.cz> wrote:
> Anyway, the precomputation may or may not help.
> In you algorithm you need Z[i][0] and Z[i][3] to be
> odd, and there's no reason not to assure it in the key schedule.
>
No, I neeedn't. Since x.y is a group operation, every 32-bit number no
matter odd or even is invertible. For example, the inverse of 0 is 0
and the inverse of 2 is 1717986918.

My pseudo-code was meant to describe my algorithm so the key schedule
contains only 100 subkeys, no extra precomputed subkeys. And every
subkey has 32 bits of uncertainty.

> Multipermutations are quite rare, the only ones I know are based on
> GF.
>
The only ones I know are based on non-singular matrices over GF.

> > and even if it is not a
> > permutation (actually it is). So we don't care to invert anything in
> > it, notably, the squaring operation.
>
> OK. But the two multiplications in the group layer need to be
> inverted, which requires division in the key schedule.
>
Of course. Just like IDEA.

> There is a paper by Bernstein and Schwabe which claims about 169 clock
> cycles for an AES block. The fastest AES I've found is a bit-sliced
> implementation (by Kaesper and Schwabe) using SSE3 and achieving 7.08
> cycles per byte on Intel Core i7 920 when processing 4KiB.
>
Thank you. I've just read them. Maybe I've found a newer version of
Kaesper and Schwabe which claims to be good when processing 128 bytes
only, which is a big improvement in latency. I had never thought that
efficient bit slicing is possible at all for AES. Thanks again.

> By switching to 64 bit you can get an algorithm working on 256-bit
> blocks which takes on AMD64 about the same time per block.
>
For 256-bit blocks we must reconsider key length and number of rounds.
It is very likely that we'll get double key length, double number of
rounds, double time per block, which is the same clock cycles per
byte, at best.

> I wonder if there's an cipher using the lower and upper parts of the product
> separately.

I know of no such ciphers. It must be hard to handle the statistical
imperfecy, especially of the upper part. Nevertheless, the UMAC hash
function managed to use the full products by suming them.
From: Maaartin on
On Jun 19, 6:10 pm, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> On Jun 19, 1:01 pm, Maaartin <grajc...(a)seznam.cz> wrote:> Anyway, the precomputation may or may not help.
> > In you algorithm you need Z[i][0] and Z[i][3] to be
> > odd, and there's no reason not to assure it in the key schedule.
>
> No, I neeedn't. Since x.y is a group operation, every 32-bit number no
> matter odd or even is invertible. For example, the inverse of 0 is 0
> and the inverse of 2 is 1717986918.

You wrote
x1 += Z[i][0] * (2*x1 + 1);
and not
x1 += Z[i][0] . x1
so I assumed you replaced the IDEA operation by a standard
multiplication, didn't you?

> My pseudo-code was meant to describe my algorithm so the key schedule
> contains only 100 subkeys, no extra precomputed subkeys. And every
> subkey has 32 bits of uncertainty.
>
> > OK. But the two multiplications in the group layer need to be
> > inverted, which requires division in the key schedule.
>
> Of course. Just like IDEA.
>
> > There is a paper by Bernstein and Schwabe which claims about 169 clock
> > cycles for an AES block. The fastest AES I've found is a bit-sliced
> > implementation (by Kaesper and Schwabe) using SSE3 and achieving 7.08
> > cycles per byte on Intel Core i7 920 when processing 4KiB.
>
> Thank you. I've just read them. Maybe I've found a newer version of
> Kaesper and Schwabe which claims to be good when processing 128 bytes
> only, which is a big improvement in latency. I had never thought that
> efficient bit slicing is possible at all for AES. Thanks again.
>
> > By switching to 64 bit you can get an algorithm working on 256-bit
> > blocks which takes on AMD64 about the same time per block.
>
> For 256-bit blocks we must reconsider key length and number of rounds.
> It is very likely that we'll get double key length, double number of
> rounds, double time per block, which is the same clock cycles per
> byte, at best.

I'm afraid, any change invalidates most of the analysis done, and we
can't redo it. I believe that for the same security level no doubling
of rounds is necessary. All operations with 64 bits work as good as
for the ones with 32 bits, the multiplication mixes even better. It
may take longer until all bits diffuse into all others, but IMHO you
only need each bit to diffuse into as many bits as for the 32-bit
version in order to get the same security.

> > I wonder if there's an cipher using the lower and upper parts of the product
> > separately.
>
> I know of no such ciphers. It must be hard to handle the statistical
> imperfecy, especially of the upper part.

Agreed, the upper part is quite non-uniform but somehow stronger than
the lower one. IDEA uses something similar to the difference lower-
upper, so I'd suppose using both parts separately may work, too; you
need some other operation for assuring diffusion, this may really get
complicated.

> Nevertheless, the UMAC hash
> function managed to use the full products by suming them.

That's different from what I meant.