From: Tran Ngoc Duong on
On Jun 25, 7:04 pm, Mok-Kong Shen <mok-kong.s...(a)t-online.de> wrote:
> In the i-th round, one does the following:
>
>    W2 = fi_12(W1) ^ W2;   W1 = fi_21(W2) ^ W1;
>    W4 = fi_34(W3) ^ W4;   W3 = fi_43(W4) ^ W3;
>    W3 = fi_13(W1) ^ W3;   W1 = fi_31(W3) ^ W1;
>    W4 = fi_24(W2) ^ W4;   W2 = fi_42(W4) ^ W2;
>
The step 1

W2 = fi_12(W1) ^ W2; W1 = fi_21(W2) ^ W1;

is (in general) not a multipermutation. So the uncertainty of W1 does
not propagate completely to W3 in step 3.

This statistical weakness indicates that maybe more rounds is needed.
Thus the round structure is less efficient than it should be.
From: Mok-Kong Shen on
Tran Ngoc Duong wrote:
> Mok-Kong Shen wrote:
>> In the i-th round, one does the following:
>>
>> W2 = fi_12(W1) ^ W2; W1 = fi_21(W2) ^ W1;
>> W4 = fi_34(W3) ^ W4; W3 = fi_43(W4) ^ W3;
>> W3 = fi_13(W1) ^ W3; W1 = fi_31(W3) ^ W1;
>> W4 = fi_24(W2) ^ W4; W2 = fi_42(W4) ^ W2;
>>
> The step 1
>
> W2 = fi_12(W1) ^ W2; W1 = fi_21(W2) ^ W1;
>
> is (in general) not a multipermutation. So the uncertainty of W1 does
> not propagate completely to W3 in step 3.
>
> This statistical weakness indicates that maybe more rounds is needed.
> Thus the round structure is less efficient than it should be.

Certainly with this scheme more than one round will be required for
achieving any reasonable propagation of the influence of individual
input bits (avalanche). Do you have ideas of improving the given round
structure, while keeping the matter (coding etc.) as simple as possible?

Thanks.

M. K. Shen
From: Maaartin on
On Jun 26, 2:52 pm, Mok-Kong Shen <mok-kong.s...(a)t-online.de> wrote:
> Tran Ngoc Duong wrote:
> > Mok-Kong Shen wrote:
> >> In the i-th round, one does the following:
>
> >>     W2 = fi_12(W1) ^ W2;   W1 = fi_21(W2) ^ W1;
> >>     W4 = fi_34(W3) ^ W4;   W3 = fi_43(W4) ^ W3;
> >>     W3 = fi_13(W1) ^ W3;   W1 = fi_31(W3) ^ W1;
> >>     W4 = fi_24(W2) ^ W4;   W2 = fi_42(W4) ^ W2;
>
> > The step 1
>
> > W2 = fi_12(W1) ^ W2;   W1 = fi_21(W2) ^ W1;
>
> > is (in general) not a multipermutation. So the uncertainty of W1 does
> > not propagate completely to W3 in step 3.
>
> > This statistical weakness indicates that maybe more rounds is needed.
> > Thus the round structure is less efficient than it should be.

I'm not sure, if this matters. The operations are simple and I have no
idea how many rounds are necessary. However, better mixing should be
preferred.

> Certainly with this scheme more than one round will be required for
> achieving any reasonable propagation of the influence of individual
> input bits (avalanche). Do you have ideas of improving the given round
> structure, while keeping the matter (coding etc.) as simple as possible?

What you do (simplified; all functions are different and key
dependent):

W2 ^= f(W1); W1 ^= f(W2);
W4 ^= f(W3); W3 ^= f(W4);
W3 ^= f(W1); W1 ^= f(W3);
W4 ^= f(W2); W2 ^= f(W4);

You mixed W1 into W2 and back again. Intuitively, something like

W2 ^= f(W1);
W3 ^= f(W2);
W4 ^= f(W3);
W1 ^= f(W4);

should be more efficient. Neither of the schemata allows any
parallelism, so I'd prefer

W2 ^= f(W1); W4 ^= f(W3);
W1 ^= f(W4); W3 ^= f(W2);

which executes twice as much rounds per time unit on my computer. I
wonder if there's a theory behind.

Finally, since all the functions are bijective, you can do something
like

for (i=1; i<=4; ++i) W[i] = f(W[i]);
W = M * W

where W is the vector of all W1...4 and M is an MDS matrix. No idea
how good this can be. The function is obviously bijective, but
inverting would be inefficient (which doesn't matter here).
From: Mok-Kong Shen on
Maaartin wrote:

> What you do (simplified; all functions are different and key
> dependent):
>
> W2 ^= f(W1); W1 ^= f(W2);
> W4 ^= f(W3); W3 ^= f(W4);
> W3 ^= f(W1); W1 ^= f(W3);
> W4 ^= f(W2); W2 ^= f(W4);
>
> You mixed W1 into W2 and back again. Intuitively, something like
>
> W2 ^= f(W1);
> W3 ^= f(W2);
> W4 ^= f(W3);
> W1 ^= f(W4);
>
> should be more efficient.

I suppose you are comparing the first with 2 rounds of the second.
It would be very fine, if one could theoretically show that the second
case always result in a better 'mixing' than the first. (I was not very
sure, and thus had not used the second scheme.)

M. K. Shen

> Neither of the schemata allows any
> parallelism, so I'd prefer
>
> W2 ^= f(W1); W4 ^= f(W3);
> W1 ^= f(W4); W3 ^= f(W2);
>
> which executes twice as much rounds per time unit on my computer. I
> wonder if there's a theory behind.
>
> Finally, since all the functions are bijective, you can do something
> like
>
> for (i=1; i<=4; ++i) W[i] = f(W[i]);
> W = M * W
>
> where W is the vector of all W1...4 and M is an MDS matrix. No idea
> how good this can be. The function is obviously bijective, but
> inverting would be inefficient (which doesn't matter here).

From: Tran Ngoc Duong on
On Jun 26, 8:53 pm, Maaartin <grajc...(a)seznam.cz> wrote:
> On Jun 26, 2:52 pm, Mok-Kong Shen <mok-kong.s...(a)t-online.de> wrote:
>
> > Tran Ngoc Duong wrote:
> > > Mok-Kong Shen wrote:
> > >> In the i-th round, one does the following:
>
> > >> W2 = fi_12(W1) ^ W2; W1 = fi_21(W2) ^ W1;
> > >> W4 = fi_34(W3) ^ W4; W3 = fi_43(W4) ^ W3;
> > >> W3 = fi_13(W1) ^ W3; W1 = fi_31(W3) ^ W1;
> > >> W4 = fi_24(W2) ^ W4; W2 = fi_42(W4) ^ W2;
>
> > > The step 1
>
> > > W2 = fi_12(W1) ^ W2; W1 = fi_21(W2) ^ W1;
>
> > > is (in general) not a multipermutation. So the uncertainty of W1 does
> > > not propagate completely to W3 in step 3.
>
> > > This statistical weakness indicates that maybe more rounds is needed.
> > > Thus the round structure is less efficient than it should be.
>
> I'm not sure, if this matters. The operations are simple and I have no
> idea how many rounds are necessary.
>
So am I. There are examples and counter-examples of block ciphers
regarding the "complete diffusion" approach.

a) DES and CAST are counter-examples. The designers opted for non-
bijective f-functions, which effectively means that the uncertainty of
the first data half doesn't propagate to the second half completely.
(Note: I am silent about the propagation of the key bits.)

b) GOST and Skipjack are examples, as the f-functions used in these
Feistel structures are bijective. In Skipjack, furthermore, every data
word completely propagates to the remaining three.

c) There is also stronger examples (I don't remember reference) where
each round is a [classical, 2-block] Feistel structure whose f-
function is not only bijective but also orthomorphic. The orthomorphic
function, itself, may be implemented by a 2-round Feistel structure.
It is worth to note that this is effectively 4-block 'mixing'
structure which may fit Mok-Kong Shen's needs.

For the classical (2-block) Feistel structure there is an argument
speaking for the "complete diffusion" approach: the number of active
rounds in an iterative differential characteristics.

For an N-round cipher, an iterative differential characteristics may
consist of about N/2 active rounds if the cipher is of type a),
whereas the respective number is 2N/3 for type b), and 3N/4 for type
c).


> However, better mixing should be
> preferred.
> ...
> What you do (simplified; all functions are different and key
> dependent):
>
> W2 ^= f(W1); W1 ^= f(W2);
> W4 ^= f(W3); W3 ^= f(W4);
> W3 ^= f(W1); W1 ^= f(W3);
> W4 ^= f(W2); W2 ^= f(W4);
>
> You mixed W1 into W2 and back again. Intuitively, something like
>
> W2 ^= f(W1);
> W3 ^= f(W2);
> W4 ^= f(W3);
> W1 ^= f(W4);
>
> should be more efficient. Neither of the schemata allows any
> parallelism, so I'd prefer
>
> W2 ^= f(W1); W4 ^= f(W3);
> W1 ^= f(W4); W3 ^= f(W2);
>
> which executes twice as much rounds per time unit on my computer.
>
Mok-Kong Shen's algorithm allows the same parallelism as your
preferred algorithm, namely two parallel f-computations. If it was
slower on your computer that might be compiler's or programmer's
problem. So performance is not an issue there.

> I wonder if there's a theory behind.
>
Your non-preferred algorithm, its inverse, as well as your preferred
algorithm may be based on the model on 4-bit circular shift feedback
registers, with the generating polynomial

1 ^ x ^ x**4,
1 ^ x**3 ^ x**4, and
1 ^ x**2 ^ x**4, respectively.

In the open literature there are some articles of Knudsen, Wagner,
Robshaw and Reichardt (on the structure of Skipjack cipher). I believe
the articles give enough information to deduce the optimal number of
rounds, i.e. the maximum number of rounds that still prohibits the
existence of iterative (truncated) differential characteristics.