From: Johannes Baagoe on
Sometime ago, I proposed a hash function for javascript, with the
warning that it was not extensively tested, and that you might succeed
in proving me wrong in the claim that "it distributes *any* data over
*any* number of buckets in a suitably pseudo-random way".

It doesn't, I'm sorry to say. A minimal sanity test is to hash all
the words in a largish dictionary, and count collisions. Take for
example the /usr/share/dict/british-english-insane wordlist (it is
the `wbritish-insane` package on Ubuntu). Version 6.3 has 638 286
words. For 32-bit hash values, one would expect in the neighbourhood of
47 collisions, say, plus or minus a dozen or so. (The formula is here:
http://en.wikipedia.org/wiki/Birthday_problem#Collision_counting)
Anything above 70 is definitely too large. The function I proposed
fails miserably - it has 1079 collisions. Don't use it.

How do I know? Because I wrote a simple function that counts collisions
in any kind of disk file, for whatever hash function one supplies :


function collisions(file, hash) {
var seen = [];
var nCollisions = 0;

var words = FileIO(file, 'r', 'utf-8');

while (! words.eof()) {
var word = words.getLine();
var h = hash(word);

if (seen[h]) {
nCollisions++;
} else {
seen[h] = true;
}
}

return nCollisions;
}

That is very straightforward, *except* for the FileIO function. (I
gave it a capital initial, since it returns an Object. It is not a
proper javascript constructor, though, since it doesn't return `this`
and indeed doesn't use prototypes at all. I never do any more, I have
yet to be convinced that the entire concept is not a misguided attempt
to force round pegs into square holes. Convince me if you can.)

That is where node.js comes in. FileIO is defined like this, in its
own FileIO.js file :

var fs = require('fs');

exports.FileIO = function(fileName, mode, encoding) {
var fd = fs.openSync(fileName, mode);
var offset = 0;

function read() {
return fs.readSync(fd, 1024, offset, encoding);
}

return {
eof: function() {
return read()[1] <= 0;
},
getLine: function() {
var line = read()[0].split('\n')[0];
offset += line.length + 1;
return line;
},
rewind: function() {
offset = 0;
}
};
}

It is far from perfect: no error checking, synchronous operation only,
bare minimum functionality, wasteful calls of `read`, etc. But it
gets the job done. Suggestions for improvement are welcome, as are
other comments.

--
Johannes