From: Paavo Helde on

I am looking for a decent algorithm for calculating of some kind of hash of
a container, making use of the hash of the components, i.e. calculating a
composite hash from the hashes of sub-parts. We are using it just for
detecting changed data parts, so crypto strength is not actually needed. So
far I have thought of the following approaches:

* CRC32. There is a crc32_combine() function helpfully exposed from the
zlib library. However, I am a bit worried about the collisions. The CRC32
algorithm was designed basically for detecting rare bit changes AFAIK, but
in our case most of the data content changes each time, and the data sets
are quite large (up to hundreds of megabytes), meaning that the
"information compression rate" by reducing it to 32 bits is also quite
large. Am I wrong to worry about the collisions?

* MD5, with re-hashing the MD5 hashes of components to obtain the hash of
the container. If the MD5 hashes are uniformly distributed, the outcome
should be no worse than calculating the whole MD5 in one go, right? Aside
from that, maybe MD5 is an overkill for our purposes (we are doing all this
to increase performance in the first place).

Are there any alternatives? What would you suggest?

Thanks,
Paavo
From: David Eather on
Paavo Helde wrote:
> I am looking for a decent algorithm for calculating of some kind of hash of
> a container, making use of the hash of the components, i.e. calculating a
> composite hash from the hashes of sub-parts. We are using it just for
> detecting changed data parts, so crypto strength is not actually needed. So
> far I have thought of the following approaches:
>
> * CRC32. There is a crc32_combine() function helpfully exposed from the
> zlib library. However, I am a bit worried about the collisions. The CRC32
> algorithm was designed basically for detecting rare bit changes AFAIK, but
> in our case most of the data content changes each time, and the data sets
> are quite large (up to hundreds of megabytes), meaning that the
> "information compression rate" by reducing it to 32 bits is also quite
> large. Am I wrong to worry about the collisions?

You would expect a collision about 1 time in 64 thousand with CRC-32
>
> * MD5, with re-hashing the MD5 hashes of components to obtain the hash of
> the container. If the MD5 hashes are uniformly distributed, the outcome
> should be no worse than calculating the whole MD5 in one go, right?

right.

Aside
> from that, maybe MD5 is an overkill for our purposes (we are doing all this
> to increase performance in the first place).
>
> Are there any alternatives? What would you suggest?
>
> Thanks,
> Paavo

Use MD-4 - it is faster than MD-5 and in some cases faster than CRC-32.
The 128-bit output is much more collision resistant than the 32 bits of
crc-32
From: unruh on
On 2010-01-26, Paavo Helde <myfirstname(a)osa.pri.ee> wrote:
>
> I am looking for a decent algorithm for calculating of some kind of hash of
> a container, making use of the hash of the components, i.e. calculating a
> composite hash from the hashes of sub-parts. We are using it just for
> detecting changed data parts, so crypto strength is not actually needed. So
> far I have thought of the following approaches:
>
> * CRC32. There is a crc32_combine() function helpfully exposed from the
> zlib library. However, I am a bit worried about the collisions. The CRC32
> algorithm was designed basically for detecting rare bit changes AFAIK, but
> in our case most of the data content changes each time, and the data sets
> are quite large (up to hundreds of megabytes), meaning that the
> "information compression rate" by reducing it to 32 bits is also quite
> large. Am I wrong to worry about the collisions?
>
> * MD5, with re-hashing the MD5 hashes of components to obtain the hash of
> the container. If the MD5 hashes are uniformly distributed, the outcome
> should be no worse than calculating the whole MD5 in one go, right? Aside
> from that, maybe MD5 is an overkill for our purposes (we are doing all this
> to increase performance in the first place).

MD4 is apparently faster but is cryptographically weak (stronger than
CRC of course). Look at rsync and bittorrent which do what you want to
do-- they have spent lots of time thinking about the issue.


>
> Are there any alternatives? What would you suggest?
>
> Thanks,
> Paavo
From: Joseph Ashwood on
"Paavo Helde" <myfirstname(a)osa.pri.ee> wrote in message
news:Xns9D0CCFDC13C1Epaavo256(a)216.196.109.131...
> I am looking for a decent algorithm for calculating of some kind of hash
> of
> a container, making use of the hash of the components, i.e. calculating a
> composite hash from the hashes of sub-parts. We are using it just for
> detecting changed data parts, so crypto strength is not actually needed.
> So
> far I have thought of the following approaches:

> Are there any alternatives? What would you suggest?

The structure you are looking for is called a Merkle tree, using it will
save you enormous amounts of time. Also known as library signing, the proof
is quite solid. As for the speed, don't worry about it, the Merkle tree will
increase speeds significantly for updates. Instead worry about the fact that
someone will eventually attack the system, and when that happens you'll have
to hire someone like me to dig through all the data, sort out the problems,
find solutions, and send you a very large bill for my time. So my first
recommendation is if you are targeting only 32-bit systems use SHA-256,
otherwise use SHA-512.

Strongly formated systems are one of the most important things, if in doubt,
include the data. Assuming you have multiple files in the system, include
the file name and path in the metadata that gets hashed, include file
length, include update date and time, include who updated it and who
approved the update, include everything.

Building a Merkle tree is actually quite easy, I'll be using the standard
library metaphore, obviously substitute in your situation.

Hash all the pages individually, giving the pages[] table of hashes.
Hash all the pages in each book into the books[] table of hashes
Hash all the books on each shelf into the shelves[] table of hashes
Hash all the shelves in each bookcase into the cases[] table of hashes
Hash all the cases on each aisle into the aisles[] table of hashes
Hash all the aisles in a room into the rooms[] table of hashes
Hash all the rooms in a library into the final library hash.

Store all the hashes for later use. When a page is replaced,
generate the new pages[] entry by hashing the new page and replacing the
old.
generate the new books[] entry by hashing only the pages in the updated
book, replacing the old
generate the new shelves[] entry by hashing only the books on the updated
shelf, replacing the old
generate the new cases[] entry by hashing only the shelves in the updated
case, replacing the old
generate the new aisles[] entry by hashing only the cases on the updated
aisle, replacing the old
generate the new rooms[] entry by hashing only the aisles in the updated
room, replacing the old
Hash all the rooms in the library into the final library hash

The important change is that since only a small portion of the hashes have
to be recomputed, instead of hashing GBs of data, the system can be designed
to only hash a few KB.

To add data
hash the new page, add it to the list of hashes for the updated book
follow the update method from here

You'll want to actually implement this as a tree structure in memory, and
use recursion in the implementation it will save you a lot of time and
effort. Store the entire Merkle tree, and just extract afterward. Walking
down the tree the system will be easily able to determine exactly what needs
to be downloaded.
Joe

From: Paavo Helde on
"Joseph Ashwood" <ashwood(a)msn.com> wrote in
news:vbL7n.20613$mT7.4828(a)newsfe05.iad:

> "Paavo Helde" <myfirstname(a)osa.pri.ee> wrote in message
> news:Xns9D0CCFDC13C1Epaavo256(a)216.196.109.131...
>> I am looking for a decent algorithm for calculating of some kind of
>> hash of
>> a container, making use of the hash of the components, i.e.
>> calculating a composite hash from the hashes of sub-parts. We are
>> using it just for detecting changed data parts, so crypto strength is
>> not actually needed. So
>> far I have thought of the following approaches:
>
>> Are there any alternatives? What would you suggest?
>
> The structure you are looking for is called a Merkle tree, using it
> will save you enormous amounts of time. Also known as library signing,

It seems Merkle tree is a kind of hierarchical stucture, which we already
have, so the only missing part is choosing the hash type and how to
combine them. As already said, cryptographic resistance is not required,
so we will probably go with MD4 suggested by other respondents.

> the proof is quite solid. As for the speed, don't worry about it, the
> Merkle tree will increase speeds significantly for updates. Instead
> worry about the fact that someone will eventually attack the system,

The most of what a would-be attacker would achieve would be to damage his
own data analysis results, so I would not worry about that too much.

[...]
> Walking down the tree the system will be easily able to determine
> exactly what needs to be downloaded.

No download involved here, you are assuming too much.

Paavo