From: Tran Ngoc Duong on
On 17 Tháng Sáu, 12:21, Maaartin <grajc...(a)seznam.cz> wrote:
> On Jun 17, 5:45 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
>
> > On Jun 16, 7:59 pm, Maaartin <grajc...(a)seznam.cz> wrote:
> > > Right. Could you give me a reference in the literature?
>
> > I'm afraid I can't. Personally, I observed the group property myself
> > some time around 1994, when I was impressed by the "no S-boxes"
> > concept of the IDEA block cipher, and tried to "design" something like
> > it for 32-bit machines. The group property of 2*x*y + y + z came to me
> > first. When I found the isomorphism to the multiplicative group of odd
> > integers, generalization to (1<<m)*k*x*y + x + y followed easily. As
> > an undergraduate student without formal math-phys trainings, within my
> > (very limited) knowledge, I was not aware of such references in
> > literature. I'm not aware of that even now, too.
>
> I ask only because I thought, there can be more information there. I
> was impressed by IDEA, too, and I still wonder why it uses its
> pseudomultiplication instead of something like 2*x*y+x+y, which can be
> written as x + (2*x+1) * y, and executed using only 3 instructions on
> x86 architectures using LEA.
>
And, on any architecture, if memory is not a very concern one may
precompute 2*x + 1 to save one instruction, as M. K. Shen has pointed
out.

I think Lai & Massey's decision for multiplication mod (2^n + 1),
rather than something like 2*x*y + x + y, was a right decision, after
balancing between implementation complication and mathematical
complexity (aka, level of "confusion").

Integer multiplication those days was very time-consuming in
comparison to addition and jumps, so from the implementation view the
two operations are, in essence, equally complicated. From the
mathematical view, on the other hand, multiplication mod (2^n +1) is
better since it is more "confusing". If my memory serves, Lai has
proved that the multiplication mod (2^16 + 1), unlike 2*x*y+x+y,
cannot be written as a polynomial over the ring Z(2^16). And maybe it
is superior to 2*x*y+x+y with respect to differential characteristics
(in relevant definitions of the "difference"), I think.
From: Maaartin on
On Jun 17, 9:13 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> On 17 Tháng Sáu, 12:21, Maaartin <grajc...(a)seznam.cz> wrote:>
> And, on any architecture, if memory is not a very concern one may
> precompute 2*x + 1 to save one instruction, as M. K. Shen has pointed
> out.

You regard x as a constant here, right? Then in can be embedded in the
instruction on most architectures. The memory access may be more
costly then the computation on modern architectures even when LEA is
not available.

> I think Lai & Massey's decision for multiplication mod (2^n + 1),
> rather than something like 2*x*y + x + y, was a right decision, after
> balancing between implementation complication and mathematical
> complexity (aka, level of "confusion").

I see. Their pseudo-multiplication uses the higher half of the
product, which is much better than the lower one.

> Integer multiplication those days was very time-consuming in
> comparison to addition and jumps, so from the implementation view the
> two operations are, in essence, equally complicated.

Currently, the situation is very different, jumps are the most time-
consuming operations. OTOH, the jumps will be predicted correctly most
of the time.

Somehow, I'm uncomfortable with the definition "Multiplication modulo
2**16+1, where the all-zero word (0x0000) is interpreted as 2**16".
The product may be 2**16 which is to be treated as 0 which is not
obvious from the definition.

> From the
> mathematical view, on the other hand, multiplication mod (2^n +1) is
> better since it is more "confusing". If my memory serves, Lai has
> proved that the multiplication mod (2^16 + 1), unlike 2*x*y+x+y,
> cannot be written as a polynomial over the ring Z(2^16).

Right, I've just found it. Actually he proved, that for any fixed x>1
it can't be written as a polynomial in y.

> And maybe it
> is superior to 2*x*y+x+y with respect to differential characteristics
> (in relevant definitions of the "difference"), I think.

In the expression 2*x*y+x+y there's no propagation to lower order
bits, which is surely bad (but makes it obviously invertible even when
using xor instead of add).
From: Tran Ngoc Duong on
On 17 Tháng Sáu, 15:20, Maaartin <grajc...(a)seznam.cz> wrote:
> On Jun 17, 9:13 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
>
> > On 17 Tháng Sáu, 12:21, Maaartin <grajc...(a)seznam.cz> wrote:>
> > And, on any architecture, if memory is not a very concern one may
> > precompute 2*x + 1 to save one instruction, as M. K. Shen has pointed
> > out.
>
> You regard x as a constant here, right?
>
Yes.

>Then in can be embedded in the instruction on most architectures.
>The memory access may be more costly then the computation on modern
> architectures even when LEA is not available.
>
Even if it is an expanded key, which is quite short and is accessed
very frequently?

> > I think Lai & Massey's decision for multiplication mod (2^n + 1),
> > rather than something like 2*x*y + x + y, was a right decision, after
> > balancing between implementation complication and mathematical
> > complexity (aka, level of "confusion").
>
> I see. Their pseudo-multiplication uses the higher half of the
> product, which is much better than the lower one.
>
Right. That's exactly what made me uncertain feelings with the group
2*x*y + x + y.

> > Integer multiplication those days was very time-consuming in
> > comparison to addition and jumps, so from the implementation view the
> > two operations are, in essence, equally complicated.
>
> Currently, the situation is very different, jumps are the most time-
> consuming operations. OTOH, the jumps will be predicted correctly most
> of the time.
>
And we're yet to take into account the timing variance. In 1994 most
of us were happy as we weren't aware of timing attack.

Taking the fact that multiplication is much cheaper now, an IDEA-like
cipher based on permutation polynomials, even with substantially more
rounds (say, 16) turns out to be quite speed-comparable with well-
known ciphers without memory lookup tables, e.g. Serpent or RC6. (See
the pseudo-code below.)

> Somehow, I'm uncomfortable with the definition "Multiplication modulo
> 2**16+1, where the all-zero word (0x0000) is interpreted as 2**16".
> The product may be 2**16 which is to be treated as 0 which is not
> obvious from the definition.
>
Maybe "multiplication modulo 2**16+1, where largest value 2**16 =
0x10000 is exceptionally represented by the all-zero word (0x0000)"
sounds better?

> > From the
> > mathematical view, on the other hand, multiplication mod (2^n +1) is
> > better since it is more "confusing". If my memory serves, Lai has
> > proved that the multiplication mod (2^16 + 1), unlike 2*x*y+x+y,
> > cannot be written as a polynomial over the ring Z(2^16).
>
> Right, I've just found it. Actually he proved, that for any fixed x>1
> it can't be written as a polynomial in y.
>
Thanks. It was so long time that I can't remember exactly what was
proved without re-reading his monograph.

> > And maybe it
> > is superior to 2*x*y+x+y with respect to differential characteristics
> > (in relevant definitions of the "difference"), I think.
>
> In the expression 2*x*y+x+y there's no propagation to lower order
> bits, which is surely bad (but makes it obviously invertible even when
> using xor instead of add).
>
Yes. Nevertheless the usual method (I mean rotation) which propagates
it to the low bits in the next round(s), may reduce the bad.

Do you think it's worth to consider something like this:

// an IDEA-like cipher for 32-bit machines
// using permutation polynomials 2*x*y + x + y and x*(2*x + 1).
// pseudo-code only.

unsigned Z[17][6]; // expanded key. Z[17][4] and Z[17][5] unused.
unsigned x1, x2, x3, x4, u, v, z1, z2, z3, z4, z5;
int i;

(z1, z2, z3, z4, z5) = key;
expand (z1, z2, z3, z4, z5) into Z[][];

(x1, x2, x3, x4) = plaintext;
for( i=0; i<16; i++)
{
// The group layer
x1 += Z[i][0] * (2*x1 + 1);
x2 ^= Z[i][1];
x3 ^= Z[i][2];
x4 += Z[i][3] * (2*x4 + 1);

// The MA block
u = (x1 ^ x3) <<< 15;
v = (x2 ^ x4) <<< 15;
u += Z[i][4];
u = u*(2*u + 1) <<< 15;
v += u;
v ^= Z[i][5];
v = v*(2*v + 1) <<< 15;
u += v;
x1 ^= v;
x2 ^= u;
x3 ^= v;
x4 ^= u;
swap x2, x3;
}

// The final group layer
x1 += Z[i][0] * (2*x1 + 1);
x2 ^= Z[i][1];
x3 ^= Z[i][2];
x4 += Z[i][3] * (2*x4 + 1);
swap x2, x3;

ciphertext = (x1,x2,x3,x4);

I has never implemented it, but I guess an optimized implementation
takes about 15-20 clock cycles per byte on a modern processor.
From: Maaartin on
On Jun 17, 7:18 pm, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> On 17 Tháng Sáu, 15:20, Maaartin <grajc...(a)seznam.cz> wrote:> On Jun 17, 9:13 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> >Then in can be embedded in the instruction on most architectures.
> >The memory access may be more costly then the computation on modern
> > architectures even when LEA is not available.
>
> Even if it is an expanded key, which is quite short and is accessed
> very frequently?

The memory read can take place concurrently to other operations within
only one cycle. It may become a bottleneck if there are other memory
accesses, or it can help to free other busy resource. But I don't
think that such a many-times-repeated multiplication with a non-
constant value is frequent.

> Right. That's exactly what made me uncertain feelings with the group
> 2*x*y + x + y.

There's an algorithm using something similar: RC6, but they choose a
sort of squaring instead.

> And we're yet to take into account the timing variance. In 1994 most
> of us were happy as we weren't aware of timing attack.

Right. But multiplication is considered to be evil since on some
architectures it may lead to timing attack as well.

> Taking the fact that multiplication is much cheaper now, an IDEA-like
> cipher based on permutation polynomials, even with substantially more
> rounds (say, 16) turns out to be quite speed-comparable with well-
> known ciphers without memory lookup tables, e.g. Serpent or RC6. (See
> the pseudo-code below.)
>
> > Somehow, I'm uncomfortable with the definition "Multiplication modulo
> > 2**16+1, where the all-zero word (0x0000) is interpreted as 2**16".
> > The product may be 2**16 which is to be treated as 0 which is not
> > obvious from the definition.
>
> Maybe "multiplication modulo 2**16+1, where largest value 2**16 =
> 0x10000 is exceptionally represented by the all-zero word (0x0000)"
> sounds better?

I don't think so: You need 0 -> 0x10000 before and 0x10000 -> 0 after
the multiplication. This probably can't be expressed in a really short
and clear way.

> Yes. Nevertheless the usual method (I mean rotation) which propagates
> it to the low bits in the next round(s), may reduce the bad.
>
> Do you think it's worth to consider something like this:

I can't tell. But you haven't got it right, there's no data * data
multiplication in IDEA, so the row
u = u*(2*u + 1) <<< 15;
must be wrong. Unless it's what you want to do.

> I has never implemented it, but I guess an optimized implementation
> takes about 15-20 clock cycles per byte on a modern processor.

That too slow, isn't it?
From: Tran Ngoc Duong on
On 18 Tháng Sáu, 02:27, Maaartin <grajc...(a)seznam.cz> wrote:
> On Jun 17, 7:18 pm, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
>
> > On 17 Tháng Sáu, 15:20, Maaartin <grajc...(a)seznam.cz> wrote:> On Jun 17, 9:13 am, Tran Ngoc Duong <tranngocdu...(a)gmail.com> wrote:
> > >Then in can be embedded in the instruction on most architectures.
> > >The memory access may be more costly then the computation on modern
> > > architectures even when LEA is not available.
>
> > Even if it is an expanded key, which is quite short and is accessed
> > very frequently?
>
> The memory read can take place concurrently to other operations within
> only one cycle. It may become a bottleneck if there are other memory
> accesses, or it can help to free other busy resource. But I don't
> think that such a many-times-repeated multiplication with a non-
> constant value is frequent.
>
I'm not sure if I understand it correctly. What I've tried to say was
the case of mixing a data word with a (sub-)key word in a block /
stream cipher, which may be surely regarded as many-times-repeated
multiplication by a in-memory constant value, and it is quite typical.

> > Right. That's exactly what made me uncertain feelings with the group
> > 2*x*y + x + y.
>
> There's an algorithm using something similar: RC6, but they choose a
> sort of squaring instead.
>
I see. Exactly that squaring was used in my pseudo-code.

> > And we're yet to take into account the timing variance. In 1994 most
> > of us were happy as we weren't aware of timing attack.
>
> Right. But multiplication is considered to be evil since on some
> architectures it may lead to timing attack as well.
>
"Evil" sounds somehow too strong to me. Multiplication is simply not
suitable for these architectures -- legacy or low-end chips, I think.
Just like memory lookup for architectures with non-constant memory
access time.

> > Do you think it's worth to consider something like this:
>
> I can't tell. But you haven't got it right, there's no data * data
> multiplication in IDEA, so the row
>   u = u*(2*u + 1) <<< 15;
> must be wrong. Unless it's what you want to do.
>
I want to. I think it (the quadratic polynomial) is better than 2*u*z
+ u + z, since the later is linear in u and in z.

> > I has never implemented it, but I guess an optimized implementation
> > takes about 15-20 clock cycles per byte on a modern processor.
>
> That too slow, isn't it?
>
Really? I don't think so. Serpent yields about 75 MB/s on my 2-core
2GHz machine (under a latest version of TrueCrypt). That is 2*2E9 /
(75*1024*1024) = 51 clock cycles / byte.