Hash Functions
External
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/Ob0K7/hash-tables-implementation-details-part-i
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/1Y8t1/hash-tables-implementation-details-part-ii
Internal
Overview
A hash function h() maps the universe U of keys to the slots of the hash table T[0 ... n-1], where the size n of the hash table is typically much smaller than the size of U:
h:U → {0, 1, ..., n-1}
The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the has function should be easy to store.
A hash functions must be deterministic, in that a given input k must always produce the same output h(k). A well designed hash function must also minimize the number of collisions. This can be achieved if each key is equally likely to hash to any of the n buckets, independently of where any other key has hashed to. This behavior is called simple uniform hashing. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.
Collisions are handled by the hash table implementation with chaining or open addressing.
In case the hash table is implemented using chaining, the running times of all operations will be given by the length of the chaining lists, and the average length of those lists depends on the quality of the hash function: better the hash function spreads the keys over buckets, more balanced and short the chaining lists are. Equal-length linked lists are better than unequal linked lists, so we want the hash function to spread data the data out as equally as possible.
A similar situation is encountered in case the hash table is implemented with open addressing. The running time is governed by the length of the probe sequence: how many time you have to look in the hash table until a suitable empty bucket is found. The probe sequence is going to depend on the hash function. For good hash functions, that spread data evenly, the probe sequences will be relatively short. For a constant hash function, each new addition is a collision with increasingly long probe sequence.
In hashing, the standard in spreading data out is the performance of a completely random function. However, completely random hashing cannot be used for a very good reason: once a random hash has been used to store the key, we also need to retrieve the key, and this is not possible if we don't store the key - position association. If we do that, the solution devolves to a linked list, the lookup will be in linear time - we need to scan all the existing keys to find the position, negating the benefits of a hash table.
Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time.
⚠️ It is really easy to inadvertently design bad has functions, and bad has functions lead to poor hash table performance.
Properties of a Good Hash Function
- Should have good performance - constant running time.
- Should be easy to store.
Example of a Reasonably Good Hash Function
In the first part you take an object that is not intrinsically numeric and turn it into an integer, possibly a large integer. This stage is called "computing the hash code". The, the hash code so computed must be mapped over the bucket range {0, 1, ..., n-1}. This stage is called the "compression function".
Compute the Hash Code
When the keys are natural numbers themselves, there is nothing more to be done. However, when the keys are strings, for example, they can be mapped to natural numbers by interpreting them as an integer expressed in a suitable radix notation: each character is expressed as its corresponding 0-127 ASCII code and the position of the character in the string dictates the power 128 is raised to. The result is a natural number, which can grow quite large for long strings. In the context of an application, we can usually devise some method of translating a key into a natural number. The Java Object's hashCode() is aimed to provide this functionality - mapping an object key to an integer numbers.
Compression Function
Hashing by Division
The division method computes the hash value as the remainder when the key value is divided by a specified prime number. This gives good results if we choose a prime number that is unrelated to any patterns in the distribution of keys. We assume that a non-natural number key has been mapped to a natural number using the approach described in the Compute the Hash Code section.
h(k) = k mod m
Hashing by Multiplication
The multiplication method computes the hash value in two steps:
- Multiply the k value with a constant in the range 0 < A < 1 and extract the fractional part of k⋅A.
- Multiply this value by m and take the floor of the result.
h(k) = ⌊m (k⋅A mod 1)⌋
Knuth suggests that A = (√5 - 1)/2 will work well.
The advantage of the multiplication method is that is not sensitive to the value of m, so m can be chosen to be a power of 2.
The most common compression function is % n (Modulus n).
Known Hash Functions
Universal Hashing
Pathological Data Sets
Every hash function has a data set for which it performs really bad - its "pathological data set".
More details universal hashing, pathological data sets, etc. in:
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/1SQo3/pathological-data-sets-and-universal-hashing-motivation
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/NA9cO/universal-hashing-definition-and-example-advanced-optional
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional.