From: Unruh on
Peter Fairbrother <zenadsl6186(a)zen.co.uk> writes:

>Dmitry Chumack wrote:

>> Hi *
>>
>> I have a question. There is few doesns of __terabytes__ of data. I need
>> to split this data to blocks of fixed size. Than I need to calculate a
>> sha1 hash for each block and store it. The question is which minimal
>> size of block I have to choose to minimize the probability of existence
>> of two equal hashes? What probability of such a coincidence would be if
>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
>> md5 and crc32)?
>>
>> Thanks in advance.
>>

>For n << b a good approximation of the probability of a collision is
>(n^2)/(2^b) where n is the number of blocks and b is the size of the hash in
>bits.

I think you mean if n<2^(b/2), not n<<b.
A better way of writing this is 1-exp(-n^2/2^b) which is good for n<<2^b



>That's assuming no-one is trying to cheat and create blocks to be hashed
>which hash to the same hash.

>Of course any two blocks which are the same will have the same hash.

And for most things which you hash, that is far more likely than the same
hash from two different blocks because the content of the blocks for almost
all large files is not uniformly distributed.


>Combinations could be useful if you record and use the results of both
>hashes, so eg a 128-bit hash and a 256-bit hash would give 384 bits of
>result, and b would be 384 in this case (assuming the hash functions are
>independant). But there isn't much point.

Agreed-- and this does not solve the same input->same output problem ( if
indeed it is a problem).




>To put this into real numbers - using a SHA-1 hash with 160 bits of output,
>and supposing 1,000,000 blocks (each a few dozen MB in size) the chance of a
>collision would be about 1 in 10^36. That ain't gonna happen.

>For 1,000,000,000 blocks (each a few dozen kB in size) the chance of a
>collision would be about 1 in 10^30. That ain't gonna happen either.


>--
>Peter Fairbrother

From: Unruh on
"Sergei" <silentser(a)gmail.com> writes:

>Joseph Ashwood wrote:
>> "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message
>> news:1165339092.911172.123630(a)79g2000cws.googlegroups.com...
>> > Hi *
>> >
>> > I have a question. There is few doesns of __terabytes__ of data. I need
>> > to split this data to blocks of fixed size. Than I need to calculate a
>> > sha1 hash for each block and store it. The question is which minimal
>> > size of block I have to choose to minimize the probability of existence
>> > of two equal hashes? What probability of such a coincidence would be if
>> > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
>> > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
>> > md5 and crc32)?
>>
>> Since you're talking about 2 generated values colliding, in a cryptographic
>> hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
>> 2^64, and SHA about 2^80 hashes generated before you collide. Considering
>> the size of your data pile, you'll probably get collisions sooner because
>> there is almost certainly duplicate data.
>>
>> Here's how I'd do it, and I'd include encryption as well:
>> Encrypt everything with AES in CTR mode, the odds of a collision in the data
>> are now very small, about 1/16777216 for a petabyte of data. Then I'd use
>> SHA512 with 512KB chunks and build up a Merkle tree, this will save you
>> enormous time later on in figuring out what was changed or damaged. Assuming
>> 1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk,
>> you can store ~8000 of them per chunk with proper indexing (this will be shy
>> of the 8192 apparent maximum if you implement the indexing that is necessary
>> for security). This gives 268435 chunks, building a third level of 34
>> chunks, and a root/top/fourth level that is tiny. For a total of about
>> 2*10^9 chunks, 2*10^9
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> 2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit
>> conveniently on a floppy disk, 1200 of them to a CD, and I don't even want
>> to do the math on the other options. On a system with modern cpu you should
>> be able to achieve this in about a month of compute time (assuming a quad
>> core, scale appropriately), Tom's benchmark website
>> http://libtomcrypt.com:80/ltc113.html has more information to help you run
>> the numbers. The overall odds of a collision in this are in the neighborhood
>> of 1/10^68, which is quite frankly not gonna happen.
>>
>> It will be reasonably secure though, so don't lose the AES key. Without the
>> AES encryption the odds of collision rise sharply, but assuming your data
>> cannot be compressed beyond a reasonable amount still ignorable, and if they
>> do collide you don't have to ship one of the blocks because they are
>> identical.
>> Joe

>How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
>understand, such numbers can be given for truly random hash functions
>and MD5, SHA are just the algorithms that try to emulate such things.

Because as far as is known, both are uniformly distributed functions.



>Sergei

From: Carlos Moreno on
Sergei wrote:

> How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
> understand, such numbers can be given for truly random hash functions
> and MD5, SHA are just the algorithms that try to emulate such things.

Let f(x) = e^x + 5

If X is a random variable, then f(X) is also a random variable; notice
that f is a perfectly defined deterministic function --- if its argument
is a random variable, the "mapped" result of applying the function is
also a random variable.

MD5 and SHA-1 are functions --- it makes sense to talk about the
probability of collision for two randomly generated input messages.
If these two input messages are independent instances of a uniformly
distributed random variable, then the probabilities of collision are
2^64 and 2^80 in the ideal case (that is, if those hash functions
indeed exhibit the right properties --- at the very least, it makes
sense to make that assumption to simplify the reasoning)

Carlos
--
From: Sergei on

Unruh wrote:
> "Sergei" <silentser(a)gmail.com> writes:
>
> >Joseph Ashwood wrote:
> >> "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message
> >> news:1165339092.911172.123630(a)79g2000cws.googlegroups.com...
> >> > Hi *
> >> >
> >> > I have a question. There is few doesns of __terabytes__ of data. I need
> >> > to split this data to blocks of fixed size. Than I need to calculate a
> >> > sha1 hash for each block and store it. The question is which minimal
> >> > size of block I have to choose to minimize the probability of existence
> >> > of two equal hashes? What probability of such a coincidence would be if
> >> > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
> >> > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
> >> > md5 and crc32)?
> >>
> >> Since you're talking about 2 generated values colliding, in a cryptographic
> >> hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
> >> 2^64, and SHA about 2^80 hashes generated before you collide. Considering
> >> the size of your data pile, you'll probably get collisions sooner because
> >> there is almost certainly duplicate data.
> >>
> >> Here's how I'd do it, and I'd include encryption as well:
> >> Encrypt everything with AES in CTR mode, the odds of a collision in the data
> >> are now very small, about 1/16777216 for a petabyte of data. Then I'd use
> >> SHA512 with 512KB chunks and build up a Merkle tree, this will save you
> >> enormous time later on in figuring out what was changed or damaged. Assuming
> >> 1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk,
> >> you can store ~8000 of them per chunk with proper indexing (this will be shy
> >> of the 8192 apparent maximum if you implement the indexing that is necessary
> >> for security). This gives 268435 chunks, building a third level of 34
> >> chunks, and a root/top/fourth level that is tiny. For a total of about
> >> 2*10^9 chunks, 2*10^9
> >> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> >> 2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit
> >> conveniently on a floppy disk, 1200 of them to a CD, and I don't even want
> >> to do the math on the other options. On a system with modern cpu you should
> >> be able to achieve this in about a month of compute time (assuming a quad
> >> core, scale appropriately), Tom's benchmark website
> >> http://libtomcrypt.com:80/ltc113.html has more information to help you run
> >> the numbers. The overall odds of a collision in this are in the neighborhood
> >> of 1/10^68, which is quite frankly not gonna happen.
> >>
> >> It will be reasonably secure though, so don't lose the AES key. Without the
> >> AES encryption the odds of collision rise sharply, but assuming your data
> >> cannot be compressed beyond a reasonable amount still ignorable, and if they
> >> do collide you don't have to ship one of the blocks because they are
> >> identical.
> >> Joe
>
> >How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
> >understand, such numbers can be given for truly random hash functions
> >and MD5, SHA are just the algorithms that try to emulate such things.
>
> Because as far as is known, both are uniformly distributed functions.
>
>
>
> >Sergei

But if SHA-1 is a uniformly distributed function, how the Chinese guys
could claim that it is possible to find a collision in 2^63 hash
operations?

Sergei

From: Mike Amling on
Carlos Moreno wrote:
>
> MD5 and SHA-1 are functions --- it makes sense to talk about the
> probability of collision for two randomly generated input messages.
> If these two input messages are independent instances of a uniformly
> distributed random variable, then the probabilities of collision are
> 2^64 and 2^80 in the ideal case

Those are pretty high probabilities. I presume you intended to type
2^-64 and 2^-80, but the in the scenario you describe, the probability
of a pair of random inputs producing the same output, the probabilities
are 2^-128 and 2^-160.
You can expect approximately one collision among 2^64 (resp. 2^80)
different inputs, because that many messages generates 2^128 (resp.
2^160) pairs.

> (that is, if those hash functions
> indeed exhibit the right properties --- at the very least, it makes
> sense to make that assumption to simplify the reasoning)

--Mike Amling