Hash tables multiple keys per value...? [Lisp]

Prev: SBCL optimization
Next: what do lispers think of bigtable?

From: Duane Rettig on 18 Jan 2010 13:18

On Jan 18, 7:11 am, Robert Uhl <eadmun...(a)NOSPAMgmail.com> wrote:
> "Frode V. Fjeld" <fr...(a)netfonds.no> writes:
>
>
>
> > I.e. to hash over some combination of each key's sxhash (taking care not
> > to combine them so as to generate a bignum).
>
> Note that SXHASH doesn't care about anything more than the top level of
> a list:
>
> cl-user> (sxhash '(1 2 3 (4)))
> 224526978
> cl-user> (sxhash '(1 2 3 (4 5)))
> 224526978

Depends on your implementation. On Allegro CL:

CL-USER(1): (sxhash '(1 2 3 (4)))
4246036626
CL-USER(2): (sxhash '(1 2 3 (4 5)))
4247011474
CL-USER(3):

(other versions of Allegro CL will generate different hash codes,
depending on endianness and 32-vs-64 bit, but all will distinguish
this case)

The issue is one of hash-code generation speed vs test predicate
lookup speed. The hash code generation for a deeper list structure
does take a bit more time, but for heavily used hash tables the better
distribution of hash codes over the range of possible hash codes
results in fewer collisions and thus fewer invocations of the test
predicate. In an 'eq hash-table, this is not a savings. But in an
'equal hash-table, the test between '(1 2 3 (4)) and '(1 2 3 (4 5))
will have to descend into all levels and lengths anyway, so the hash-
code-generation savings is lost in an average of 50% of hashing
situations (depending on whether the lookup in the table got the
desired key or the other one). All in all, we've found that users
tend to need excellent hash-code distributions in their apps because
their hash-tables must scale, so we go to extra lengths to get the
hash-code as well-distributed as possible.

Duane

From: Duane Rettig on 18 Jan 2010 13:38

On Jan 18, 10:04 am, Robert Uhl <eadmun...(a)NOSPAMgmail.com> wrote:
> Madhu <enom...(a)meer.net> writes:
> > |
> > | Note that SXHASH doesn't care about anything more than the top level
> > | of a list:
> > |
> > | cl-user> (sxhash '(1 2 3 (4)))
> > | 224526978
> > | cl-user> (sxhash '(1 2 3 (4 5)))
> > | 224526978
>
> > I believe Frodef was suggesting you should do something like:
>
> > * (reduce 'logior '(1 2 3 (4)) :key #'sxhash)
> > 1031
>
> > * (reduce 'logior '(1 2 3 (4 5)) :key #'sxhash)
> > 526343

The problem with reducing on something like logior is that it is too
simple an operation, and results in many duplicate hash codes being
generated for given inputs. The key to hash tables [sic] is excellent
hash-code distribution.

> And yet:
>
> cl-user> (reduce 'logior '(1 2 3 (4 5 6 (6 7))) :key #'sxhash)
> 501984895
> cl-user> (reduce 'logior '(1 2 3 (4 5 6 (6 7 8))) :key #'sxhash)
> 501984895

This is precisely the problem; bits are lost by logior, by virtue of
being masked by other bits in the same place in the word. The
distribution becomes full of holes and clumps of values.

> The basic issue is that SXHASH really isn't the right tool for the
> job--the guarantees that the standard gives aren't good enough for
> generating a unique key, which is what's needed to key a hashtable.

This isn't really true, either. The precise results of sxhash is
implementation dependent, based on other rules so it can be the right
tool for the job. But neither is a unique hash necessary or possible;
(re: necessary): the intention of sxhash is to give a good
distribution to reduce collisions. If this is done well, sxhash has
done its job - (re: possible): the width of the hash-code will be
bounded, yet the possible combinations of key values are not as well
bounded. The best that can be done on a normal (not "perfect") hash
is to close all the holes and to use as many of them up as possible
before duplicating a hash-code.

> That makes sense, since SXHASH is meant to _hash_ a unique key, and
> collisions are expected.

This is true, but there's nothing wrong with collisions (duplicate
hash codes) since the hash-table implementation is expected to handle
collisions.

This is purely a performance issue, not a functional one.

Duane

From: Robert Uhl on 18 Jan 2010 14:18

Duane Rettig <duane(a)franz.com> writes:
>
>> That makes sense, since SXHASH is meant to _hash_ a unique key, and
>> collisions are expected.
>
> This is true, but there's nothing wrong with collisions (duplicate
> hash codes) since the hash-table implementation is expected to handle
> collisions.

IIRC the original suggested case was to use SXHASH to generate hash
keys, which is pretty broken, but will seem to work for fairly small
hash tables.

If I recall incorrectly: doh!

--
Robert A. Uhl
To refuse awards is another way of accepting them with more noise than is
normal. --Sir Peter Ustinov

From: Ray on 26 Jan 2010 15:23

piscesboy wrote:

> Is it possible to map more than one key per hash table value? For
> example, if I create a hash table and I want to use a name and a date
> to identify a single value in the table, that's two keys that I want
> to store per entry in the table that I want to use to uniquely
> identify an entry. Because one entry can share a single date, and in
> that case, the second identifier, the name can be used to uniquely
> identify the entry.

Um, sure. If you want to use two fields to make a hash key,
that's easy. It's still a single hash key.

If you want a system where you can look up an article by *either*
key, you just want two hash tables, both with pointers to the
main object.

What was the question again? How is this different from one
of these two (trivial) solutions?

Bear

First | Prev |
Pages: 1 2 3 4 5 6
Prev: SBCL optimization
Next: what do lispers think of bigtable?