Hash Functions: Difference between revisions
Line 14: | Line 14: | ||
Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time. | Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time. | ||
=Properties of a Good Hash Function= | |||
# Should have good performance - constant running time | |||
# Should be easy to store |
Revision as of 20:36, 17 October 2021
External
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/Ob0K7/hash-tables-implementation-details-part-i
- https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/1Y8t1/hash-tables-implementation-details-part-ii
Internal
Overview
The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the has function should be easy to store.
In case the hash table is implemented using chaining, the running times of all operations will be given by the length of the chaining lists, and the average length of those lists depends on the quality of the hash function: better the hash function spreads the keys over buckets, more balanced and short the chaining lists are. Equal length linked lists is better than unequal linked lists, so we want the hash function to spread data the data out as equally as possible.
A similar situation is encountered in case the hash table is implemented with open addressing. The running time is governed by the length of the probe sequence: how many time you have to look in the hash table until a suitable empty bucket is found. The probe sequence is going to depend on the hash function. For good hash functions, that spread data evenly, the probe sequences will be relatively short. For a constant hash function, each new addition is a collision with increasingly long probe sequence.
In hashing, the standard in spreading data out is the performance of a completely random function. However, completely random hashing cannot be used for a very good reason: once a random hash has been used to store the key, we also need to retrieve the key, and this is not possible if we don't store the key - position association. If we do that, the solution devolves to a linked list, the lookup will be in linear time - we need to scan all the existing keys to find the position, negating the benefits of a hash table.
Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time.
Properties of a Good Hash Function
- Should have good performance - constant running time
- Should be easy to store