From: guillaume on
My question is related to the discussion "about the hashtable and
equals" where I learnt a few interesting things.

In my company we deal with big volume of data indexed by unique
integer ids (for background information, we deal with road traffic
data and and the road network is modelled by a set of edges
representing road sections). The edges' ids are ordered from 0 to the
total number of edges (ie around 20 millions for the French network).
However we often only deal with a subset of the road network of only
100000, 1 million or 10 millions of ids which are uniformly spread
between 0 and 20 millions.

We usually use C# Dictionary with integer keys to store any data by
edges' ids (road section characteristics or archived data, ...) and it
scales very well until several millions of ids. However using a
generic hashing algorithm with 32bits integers (potentially stretching
from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
are much coarser always seemed to me as an overkill.

One optimization I came up with was just using a dense array of the
maximal size where the key of each object is in fact its index inside
this array. It works pretty well : for example when initializing a
dictionary with 10 millions of keys randomly chosen between 0 and 20
millions, there is a gain of more than 200Mb not using a 10 millions
key/value Dictionary in spite of the fact that our array contains
actually 20 millions values. In addition, even if the Dictionary
access time theoretically approaches O(1), in my example random key
access was in average 2 times quicker using the array. Memory-wise,
this approach is also very good when I want to access read-only data
modelled by a C# struct object, as I can pre-allocate a big array of
structs on the stack and removing all the overhead creating by C#
objects (pointers, GC reference and so on...). I even pushed my idea
forward by implementing a generic ArrayDictionary<int,T> object
inheriting C# IDictionary<int,T> interface in order to switch at
runtime between the two implementation.

So all in all, my questions are :
- What is the hashing algorithm used by C# Dictionary? Is it generic
to all type of data once the GetHashCode function is provided?
- Is my optimization sounds good or should I have studied more the
possibilities provided by the existing Dictionary (using another
hashing algorithm adapted to my needs?).

Sorry for the long message, and thanks you if you read through it!

Guillaume
From: Rick Lones on
guillaume wrote:
> My question is related to the discussion "about the hashtable and
> equals" where I learnt a few interesting things.
>
> In my company we deal with big volume of data indexed by unique
> integer ids (for background information, we deal with road traffic
> data and and the road network is modelled by a set of edges
> representing road sections). The edges' ids are ordered from 0 to the
> total number of edges (ie around 20 millions for the French network).
> However we often only deal with a subset of the road network of only
> 100000, 1 million or 10 millions of ids which are uniformly spread
> between 0 and 20 millions.
>
> We usually use C# Dictionary with integer keys to store any data by
> edges' ids (road section characteristics or archived data, ...) and it
> scales very well until several millions of ids. However using a
> generic hashing algorithm with 32bits integers (potentially stretching
> from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
> are much coarser always seemed to me as an overkill.
>
> One optimization I came up with was just using a dense array of the
> maximal size where the key of each object is in fact its index inside
> this array. It works pretty well : for example when initializing a
> dictionary with 10 millions of keys randomly chosen between 0 and 20
> millions, there is a gain of more than 200Mb not using a 10 millions
> key/value Dictionary in spite of the fact that our array contains
> actually 20 millions values. In addition, even if the Dictionary
> access time theoretically approaches O(1), in my example random key
> access was in average 2 times quicker using the array. Memory-wise,
> this approach is also very good when I want to access read-only data
> modelled by a C# struct object, as I can pre-allocate a big array of
> structs on the stack and removing all the overhead creating by C#
> objects (pointers, GC reference and so on...). I even pushed my idea
> forward by implementing a generic ArrayDictionary<int,T> object
> inheriting C# IDictionary<int,T> interface in order to switch at
> runtime between the two implementation.
>
> So all in all, my questions are :
> - What is the hashing algorithm used by C# Dictionary? Is it generic
> to all type of data once the GetHashCode function is provided?
> - Is my optimization sounds good or should I have studied more the
> possibilities provided by the existing Dictionary (using another
> hashing algorithm adapted to my needs?).
>
> Sorry for the long message, and thanks you if you read through it!
>
> Guillaume

If your n keys are in fact each unique and span the range from 1 to n (or
similar) then I don't see how you can do better than a dense array, even
theoretically. A hash table helps you most when your keys are sparse and not
inherently ordered. As for the dictionary: I believe that the underlying
structure is a red-black tree, but that also will not perform as well either
space or time -wise as a straight indexed lookup for a dense well-ordered set of
keys.

The dictionary lookup would be O(log n), not O(1). Hash table access is O(1),
but with overhead to maintain the bucket structure and deal with collisions.
The array access should be O(1) with almost no overhead except for things
outside your control (e.g., paging) which affect the other methods as well.

>> - What is the hashing algorithm used by C# Dictionary? Is it generic
>> to all type of data once the GetHashCode function is provided?

Where are you coming from with this question? Dictionaries don't hash their
keys, they use them as is - they just have to be an IComparable type so they can
be properly placed within the tree. GetHashCode would have nothing to do with
dictionaries, AFAIK.

HTH,
-rick-
From: Jeroen Mostert on
On 2010-06-07 13:14, Rick Lones wrote:
<snip>
> As for the dictionary: I believe that the underlying structure is a
> red-black tree, but that also will not perform as well either space or
> time -wise as a straight indexed lookup for a dense well-ordered set of
> keys.
>
Careful. SortedDictionary<> uses a red-black tree; Dictionary<> implements a
hash table. The names of SortedDictionary and SortedList are poorly chosen,
but there you go.

--
J.
From: Rick Lones on
Jeroen Mostert wrote:
> On 2010-06-07 13:14, Rick Lones wrote:
> <snip>
>> As for the dictionary: I believe that the underlying structure is a
>> red-black tree, but that also will not perform as well either space or
>> time -wise as a straight indexed lookup for a dense well-ordered set of
>> keys.
>>
> Careful. SortedDictionary<> uses a red-black tree; Dictionary<>
> implements a hash table. The names of SortedDictionary and SortedList
> are poorly chosen, but there you go.
>

Yes, thank you. SortedDictionary was the one I meant. And I was assuming the
OP also meant that, maybe incorrectly, because it would be the best case for him.

-rick-
From: Arne Vajhøj on
On 07-06-2010 05:10, guillaume wrote:
> My question is related to the discussion "about the hashtable and
> equals" where I learnt a few interesting things.
>
> In my company we deal with big volume of data indexed by unique
> integer ids (for background information, we deal with road traffic
> data and and the road network is modelled by a set of edges
> representing road sections). The edges' ids are ordered from 0 to the
> total number of edges (ie around 20 millions for the French network).
> However we often only deal with a subset of the road network of only
> 100000, 1 million or 10 millions of ids which are uniformly spread
> between 0 and 20 millions.
>
> We usually use C# Dictionary with integer keys to store any data by
> edges' ids (road section characteristics or archived data, ...) and it
> scales very well until several millions of ids. However using a
> generic hashing algorithm with 32bits integers (potentially stretching
> from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
> are much coarser always seemed to me as an overkill.
>
> One optimization I came up with was just using a dense array of the
> maximal size where the key of each object is in fact its index inside
> this array. It works pretty well : for example when initializing a
> dictionary with 10 millions of keys randomly chosen between 0 and 20
> millions, there is a gain of more than 200Mb not using a 10 millions
> key/value Dictionary in spite of the fact that our array contains
> actually 20 millions values. In addition, even if the Dictionary
> access time theoretically approaches O(1), in my example random key
> access was in average 2 times quicker using the array. Memory-wise,
> this approach is also very good when I want to access read-only data
> modelled by a C# struct object, as I can pre-allocate a big array of
> structs on the stack and removing all the overhead creating by C#
> objects (pointers, GC reference and so on...). I even pushed my idea
> forward by implementing a generic ArrayDictionary<int,T> object
> inheriting C# IDictionary<int,T> interface in order to switch at
> runtime between the two implementation.
>
> So all in all, my questions are :
> - What is the hashing algorithm used by C# Dictionary? Is it generic
> to all type of data once the GetHashCode function is provided?

I think so.

But you can check the source code.

Either get it from MS or simply use Reflector.

> - Is my optimization sounds good or should I have studied more the
> possibilities provided by the existing Dictionary (using another
> hashing algorithm adapted to my needs?).

Nothing will beat direct indexing into an array.

Dictionary are intended to be a solution where that is
not possible.

Arne