From: Unruh on
"Sergei" <silentser(a)gmail.com> writes:


>Joseph Ashwood wrote:
>> "Sergei" <silentser(a)gmail.com> wrote in message
>> news:1165502590.529789.195440(a)n67g2000cwd.googlegroups.com...
>> > Sure. But what random data has to do with "few doesns of __terabytes__
>> > of data", which should be partishioned in blocks and hashed?
>>
>> My estimate went to the level of 1024 TB in 512KB pieces, and scales very
>> well up to 2^275 bytes for which we don't even have designations.
>>
>> > As I
>> > understood, all the estimations here were made by assuming that the
>> > data was random.
>>
>> Actually they assume a much weaker constraint that the data wasn't chosen to
>> either collide on the block size or to break the hash you choose. My
>> proposal included encryption only because that is generally desirable in
>> such large situations.
>>
>> > But why do you think this "few doesns of __terabytes__
>> > of data" contain a random data?
>>
>> We don't.
>> Joe

>Ok, but aren't such estimations as "approximately one collision among
>2^64 (resp. 2^80) different inputs" can be made only for the ideal
>case when the hashes produce truly random outputs for any kinds of
>inputs? Basically, I want to know, what would say if I claim a slightly
>different estimation for MD5 (resp. SHA-1): approximately one collision
>among 2^62 (resp. 2^77) different inputs? Would you object?

I would ask you to justify that number.
So yes, I would object.

It is of course true that if I take the hash which is say text mod 2^128
and my input is the consecutive numbers from 1 to 2^127, I will get no
collisions for that input. Ie, although the output is uniform, the input is
not and that is reflected in the output. So there is an assumption being
made about both the hash and the input.




>I undestand, this is a little bit off topic, but there is no crypto
>specialists in my department and I'm interested in this topic.

>Sergei

From: Unruh on
Peter Fairbrother <zenadsl6186(a)zen.co.uk> writes:

>Unruh wrote:

>> Peter Fairbrother <zenadsl6186(a)zen.co.uk> writes:
>>
>>> Unruh wrote:
>>
>>>> "Sergei" <silentser(a)gmail.com> writes:

>>>>> Sorry for another dilettantish question, but what is "a uniformly
>>>>> distributed function"? I thought this is simply another name for a
>>>>> random oracle function. If this is a function that simpy gives
>>>>> uniformly distributed outputs when given a uniformly distributed inputs
>>>>> then, for example, F(x)=x is also uniformly distributed. But what does
>>>>> it have to do with the collision probabilities in the case described
>>>>> here?
>>>>
>>>> It is precisely relevant. He wants to define collisions as collisions
>>>> between different inputs producing the same output. Obviously if the inputs
>>>> are not uniform, then the outputs are not since the function is
>>>> deterministic ( same input always gives the same output.)
>>
>>> No, they aren't. In fact the precise opposite would be true of a good hash
>>> function. A good hash function should give randomly-distributed outputs
>>> whether or not the inputs were randomly distributed (as long as the inputs
>>> are different).
>>
>> Of course it would not. The outputs are deterministically dependent on the
>> input. Thus if the input is biased, so is the output.

>Eh?

>> Of course the biases
>> may not be obvious. For example, lets say that there are only two inputs,
>> then there are only two, definite, outputs.

>How does only having two samples bias the distribution?

>>>> Then assuming that the input is uniform, so is the output (or at least that
>>>> is the assumption)
>>
>>> No, that isn't the assumption.

>(There is no assumption that the input is uniformly distributed - a good
>hash function will provide uniformly distributed output from a skewed input
>distribution, it's one of it's properties[1].

>The assumption that a uniformly distributed input will give a uniformly
>distributed output is not one that I have used - but for the rest of this
>post let's assume it's true, and that the inputs are randomly chosen in a
>flat distribution.)
>
>>>> If that is true, then given 2^x different inputs, the
>>>> probability
>>>> of having a collision is 1-exp(-2^(2x-N)) for x<N.
>>
>>> No, it isn't. It wouldn't be even if that was true (which it sort-of is, but
>>
>> No what isn't? I am afraid your sentence is all tangled up in its
>> referents.

>An example - say x = 1 and N = 2, ie there are two inputs and four possible
>output values. At least that's what I take you to mean by x and N.

>The chance of a collision is then 0.25 [2], not 0.632120559... as your
>formula predicts.

Agreed. This whole discussion was about large N and x, and I shared that
assumption. For small n and x, the proper formula is
1-((2^N)!/(2^N-2^x)!(2^Nx)

Which is extremely well approximated by my formula if N is large, and x is
<N-2



>[1] strictly, if the output distribution is skewed it increases the
>probability of finding different pre-images which collide - and thus the
>hash would not be good.

>[2] the first token cannot collide, there is nothing to collide with - the
>second token then has a 1 in 4 chance of having the same value as (colliding
>with) the first token.


>--
>Peter Fairbrother

From: Sergei on
Peter Fairbrother wrote:
> Sergei wrote:
>
>

> (asuming that a good hash is actually possible ... but that's another story
> entirely)
>
> --
> Peter Fairbrother
>
> always get your retaliation in beforehand

But, actually, this was my question. By providing number of operations
required to get collision with such precision we assume that MD5 or
SHA-1 are "good hash" functions, which, as you've mentioned, is not
even known to exist. Wouldn't it be a more correct way to say it:
"Well, we would like to believe and hope that it will take
approximately 2^64 (2^80) operations, but everything is possible..."

It sounds dreary, but this is the subtlety I'm trying to grasp.

Sergei

From: Unruh on
"Sergei" <silentser(a)gmail.com> writes:

>Peter Fairbrother wrote:
>> Sergei wrote:
>>
>>

>> (asuming that a good hash is actually possible ... but that's another story
>> entirely)

>But, actually, this was my question. By providing number of operations
>required to get collision with such precision we assume that MD5 or
>SHA-1 are "good hash" functions, which, as you've mentioned, is not
>even known to exist. Wouldn't it be a more correct way to say it:
>"Well, we would like to believe and hope that it will take
>approximately 2^64 (2^80) operations, but everything is possible..."

>It sounds dreary, but this is the subtlety I'm trying to grasp.

Yes, some assumptions enter into the calculation. And everything is not
possible. It has been sufficiently tested that many possibilities have been
ruled out.