From: Dmitry Chumack on
Hi *

I have a question. There is few doesns of __terabytes__ of data. I need
to split this data to blocks of fixed size. Than I need to calculate a
sha1 hash for each block and store it. The question is which minimal
size of block I have to choose to minimize the probability of existence
of two equal hashes? What probability of such a coincidence would be if
I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
md5 and crc32)?

Thanks in advance.

From: Christian Siebert on
> I have a question.
well, you've given three questions:

1)
> There is few doesns of __terabytes__ of data.
> I need to split this data to blocks of fixed size. Than I need to calculate a
> sha1 hash for each block and store it. The question is which minimal
> size of block I have to choose to minimize the probability of existence
> of two equal hashes?
A single block gives you a 0% collision probability (everything else
will give you more than 0%).

2)
> What probability of such a coincidence would be if I'll use an md5,
> crc32 or sha2 algorithms instead of sha1?
If we assume that each hash is equally likely this probability depends
only on the length of your hash. The md5 hash, for example, has a
length of 128 bit. Therefore the probability that your data gives a
certain hash value is 1/(2^128). It gets a bit more complicated if you
have several blocks => read about the birthday paradoxon.

3)
> Given that the cryptographic hashes are much longer than e.g. crc32,
> Or if I'll use a combination of this algorithms (e.g. sha1 and crc32
> or sha2 and md5 and crc32)?
Cryptographic hashes are constructed to avoid collisions and therefore
they are much longer. It is easy to find collisions with checksum like
crc-32 (because they are not meant to avoid them).

Giving the above hints, you should try to estimate the collision
probability by yourself and think again about your questions. ;-)

Christian

From: Mike Amling on
Dmitry Chumack wrote:
> Hi *
>
> I have a question. There is few doesns of __terabytes__ of data. I need
> to split this data to blocks of fixed size. Than I need to calculate a
> sha1 hash for each block and store it. The question is which minimal
> size of block I have to choose to minimize the probability of existence
> of two equal hashes? What probability of such a coincidence would be if
> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
> md5 and crc32)?

If you use SHA-1, you won't have any collisions. If you use MD5, you
won't have any collisions unless the data has been specially constructed
to produce a collision.

--Mike Amling
From: Unruh on
Mike Amling <spamonly(a)allspam.com> writes:

>Dmitry Chumack wrote:
>> Hi *
>>
>> I have a question. There is few doesns of __terabytes__ of data. I need
>> to split this data to blocks of fixed size. Than I need to calculate a
>> sha1 hash for each block and store it. The question is which minimal
>> size of block I have to choose to minimize the probability of existence
>> of two equal hashes? What probability of such a coincidence would be if
>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
>> md5 and crc32)?

Unless the block size is less than 16 bytes do not worry about collisions.
(If it is less than 16 bytes you will get collisions because the two blocks
are identical)



> If you use SHA-1, you won't have any collisions. If you use MD5, you
>won't have any collisions unless the data has been specially constructed
>to produce a collision.

>--Mike Amling
From: James Stroud on
Dmitry Chumack wrote:
> Hi *
>
> I have a question. There is few doesns of __terabytes__ of data. I need
> to split this data to blocks of fixed size. Than I need to calculate a
> sha1 hash for each block and store it. The question is which minimal
> size of block I have to choose to minimize the probability of existence
> of two equal hashes? What probability of such a coincidence would be if
> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
> md5 and crc32)?
>
> Thanks in advance.
>

Christian Siebert alludes to this probability:

N!
p(n) = 1 - ------------
(N**n)(N-n)!


Where n is the number of blocks and N is the the number of possible
hashes in the hash algorithm (i.e. 2**L) where L is the length, in bits,
of the hash function.

A user-friendly derrivation is here:

http://en.wikipedia.org/wiki/Birthday_paradox

It has been suggested that one way you can double the length of your has
if your blocks get too large is

f(f(block) + block) + f(block)

where f is a hashing function--but of course the computational
efficiency of the hash decreases by >3.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/