From: Joseph Ashwood on
"Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message
news:1165339092.911172.123630(a)79g2000cws.googlegroups.com...
> Hi *
>
> I have a question. There is few doesns of __terabytes__ of data. I need
> to split this data to blocks of fixed size. Than I need to calculate a
> sha1 hash for each block and store it. The question is which minimal
> size of block I have to choose to minimize the probability of existence
> of two equal hashes? What probability of such a coincidence would be if
> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
> md5 and crc32)?

Since you're talking about 2 generated values colliding, in a cryptographic
hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
2^64, and SHA about 2^80 hashes generated before you collide. Considering
the size of your data pile, you'll probably get collisions sooner because
there is almost certainly duplicate data.

Here's how I'd do it, and I'd include encryption as well:
Encrypt everything with AES in CTR mode, the odds of a collision in the data
are now very small, about 1/16777216 for a petabyte of data. Then I'd use
SHA512 with 512KB chunks and build up a Merkle tree, this will save you
enormous time later on in figuring out what was changed or damaged. Assuming
1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk,
you can store ~8000 of them per chunk with proper indexing (this will be shy
of the 8192 apparent maximum if you implement the indexing that is necessary
for security). This gives 268435 chunks, building a third level of 34
chunks, and a root/top/fourth level that is tiny. For a total of about
2*10^9 chunks, 2*10^9
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit
conveniently on a floppy disk, 1200 of them to a CD, and I don't even want
to do the math on the other options. On a system with modern cpu you should
be able to achieve this in about a month of compute time (assuming a quad
core, scale appropriately), Tom's benchmark website
http://libtomcrypt.com:80/ltc113.html has more information to help you run
the numbers. The overall odds of a collision in this are in the neighborhood
of 1/10^68, which is quite frankly not gonna happen.

It will be reasonably secure though, so don't lose the AES key. Without the
AES encryption the odds of collision rise sharply, but assuming your data
cannot be compressed beyond a reasonable amount still ignorable, and if they
do collide you don't have to ship one of the blocks because they are
identical.
Joe


From: Mike Amling on
Unruh wrote:
> Mike Amling <spamonly(a)allspam.com> writes:
>
>> Dmitry Chumack wrote:
>>> Hi *
>>>
>>> I have a question. There is few doesns of __terabytes__ of data. I need
>>> to split this data to blocks of fixed size. Than I need to calculate a
>>> sha1 hash for each block and store it. The question is which minimal
>>> size of block I have to choose to minimize the probability of existence
>>> of two equal hashes? What probability of such a coincidence would be if
>>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
>>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
>>> md5 and crc32)?
>
> Unless the block size is less than 16 bytes do not worry about collisions.
> (If it is less than 16 bytes you will get collisions because the two blocks
> are identical)

?? Two identical inputs producing identical outputs is not regarded
as a collision. Collisions are harder to find than that.

And identical inputs can occur regardless of whether the block size
is less than 16 bytes.

--Mike Amling
From: Luc The Perverse on
"Mike Amling" <spamonly(a)allspam.com> wrote in message
news:el55do$3rn(a)dispatch.concentric.net...
> Unruh wrote:
>> Mike Amling <spamonly(a)allspam.com> writes:
>>
>>> Dmitry Chumack wrote:
>>>> Hi *
>>>>
>>>> I have a question. There is few doesns of __terabytes__ of data. I need
>>>> to split this data to blocks of fixed size. Than I need to calculate a
>>>> sha1 hash for each block and store it. The question is which minimal
>>>> size of block I have to choose to minimize the probability of existence
>>>> of two equal hashes? What probability of such a coincidence would be if
>>>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
>>>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
>>>> md5 and crc32)?
>>
>> Unless the block size is less than 16 bytes do not worry about
>> collisions. (If it is less than 16 bytes you will get collisions because
>> the two blocks
>> are identical)
>
> ?? Two identical inputs producing identical outputs is not regarded as a
> collision. Collisions are harder to find than that.
>
> And identical inputs can occur regardless of whether the block size is
> less than 16 bytes.

I interpreted it as he didn't want any non equal blocks to produce
collisions. Obviously if you have a spot with all zeros there is going to
be a collision (or two equal blocks)

--
LTP

:)


From: Sergei on
Joseph Ashwood wrote:
> "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message
> news:1165339092.911172.123630(a)79g2000cws.googlegroups.com...
> > Hi *
> >
> > I have a question. There is few doesns of __terabytes__ of data. I need
> > to split this data to blocks of fixed size. Than I need to calculate a
> > sha1 hash for each block and store it. The question is which minimal
> > size of block I have to choose to minimize the probability of existence
> > of two equal hashes? What probability of such a coincidence would be if
> > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
> > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
> > md5 and crc32)?
>
> Since you're talking about 2 generated values colliding, in a cryptographic
> hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
> 2^64, and SHA about 2^80 hashes generated before you collide. Considering
> the size of your data pile, you'll probably get collisions sooner because
> there is almost certainly duplicate data.
>
> Here's how I'd do it, and I'd include encryption as well:
> Encrypt everything with AES in CTR mode, the odds of a collision in the data
> are now very small, about 1/16777216 for a petabyte of data. Then I'd use
> SHA512 with 512KB chunks and build up a Merkle tree, this will save you
> enormous time later on in figuring out what was changed or damaged. Assuming
> 1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk,
> you can store ~8000 of them per chunk with proper indexing (this will be shy
> of the 8192 apparent maximum if you implement the indexing that is necessary
> for security). This gives 268435 chunks, building a third level of 34
> chunks, and a root/top/fourth level that is tiny. For a total of about
> 2*10^9 chunks, 2*10^9
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> 2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit
> conveniently on a floppy disk, 1200 of them to a CD, and I don't even want
> to do the math on the other options. On a system with modern cpu you should
> be able to achieve this in about a month of compute time (assuming a quad
> core, scale appropriately), Tom's benchmark website
> http://libtomcrypt.com:80/ltc113.html has more information to help you run
> the numbers. The overall odds of a collision in this are in the neighborhood
> of 1/10^68, which is quite frankly not gonna happen.
>
> It will be reasonably secure though, so don't lose the AES key. Without the
> AES encryption the odds of collision rise sharply, but assuming your data
> cannot be compressed beyond a reasonable amount still ignorable, and if they
> do collide you don't have to ship one of the blocks because they are
> identical.
> Joe

How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
understand, such numbers can be given for truly random hash functions
and MD5, SHA are just the algorithms that try to emulate such things.

Sergei

From: Volker Hetzer on
Sergei schrieb:
> Joseph Ashwood wrote:
>> "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message
>> news:1165339092.911172.123630(a)79g2000cws.googlegroups.com...
>>> Hi *
>>>
>>> I have a question. There is few doesns of __terabytes__ of data. I need
>>> to split this data to blocks of fixed size.
How big are the blocks?

> Than I need to calculate a
>>> sha1 hash for each block and store it. The question is which minimal
>>> size of block I have to choose to minimize the probability of existence
>>> of two equal hashes?
The probability is minimal when the hash is equal (in value) to the block.
Other than that, if the hash is statistically good, the longer the hash, the
less likely the collision probability.
Of course, CPU time for search increases linearly with the hash length.

Btw, what do you want to do with the whole thing?
Fast matches? Fast searches? Anything cryptographic?
If it's just for matches and the blocks aren't very large, it's probably
cheaper in terms of CPU time to do a crc32 and compare the data in the
few cases the hashes match.


> What probability of such a coincidence would be if
>>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
>>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
>>> md5 and crc32)?
>> Since you're talking about 2 generated values colliding, in a cryptographic
>> hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
>> 2^64, and SHA about 2^80 hashes generated before you collide. Considering
>> the size of your data pile, you'll probably get collisions sooner because
>> there is almost certainly duplicate data.
[...]
> How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
> understand, such numbers can be given for truly random hash functions
> and MD5, SHA are just the algorithms that try to emulate such things.
If we knew that they behaved differently, we'd be designing different
hashes.
Consider MD5 and SHA-1 close enough to the ideal that such subtleties
don't matter.
Btw, MD5 is "cryptographically broken" in the sense that you can construct 2
blocks that have equal hashes, but it doesn't change the statistics very
much. However, since you seriously consider crc32, the "broken" state of
MD5 is of no concern to you.
The things that matter for you (IMHO) are:
- How much (cpu cycles) does it cost to compute the hash?
- How much does it cost to compare two hashes?
- How much does it cost to compare two blocks? (Is I/O time relevant too?)
- How often do two blocks have to be compared?
- How often do two hashes have to be compared?
- How often do you have to compute hashes (i.e. how often does your data
change)?
- Where do you get the hash from that you use to match the data blocks
against? Does it cost extra?

With a couple of days collecting the performance data and playing with
excel you should be able to choose the optimum hash. By the way, it's
possible too, to calculate the full MD5 hash and then to store a few
bits only, if that speeds up your searches. Makes sense only for hash
sizes larger than the crc hashes.

Lots of Greetings!
Volker
--
For email replies, please substitute the obvious.