Hash Functions

From NovaOrdis Knowledge Base
Revision as of 22:03, 17 October 2021 by Ovidiu (talk | contribs) (→‎Overview)
Jump to navigation Jump to search

External

Internal

Overview

A hash function h() maps the universe U of keys to the slots of the hash table T[0 ... n-1], where the size n of the hash table is typically much smaller than the size of U:

h:U → {0, 1, ..., n-1}

The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the has function should be easy to store.

A hash functions must be deterministic, in that a given input k must always produce the same output h(k). A well designed hash function must also minimize the number of collisions. This can be achieved if each key is equally likely to hash to any of the n buckets, independently of where any other key has hashed to. This behavior is called simple uniform hashing. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.

Collisions are handled by the hash table implementation with chaining or open addressing.

In case the hash table is implemented using chaining, the running times of all operations will be given by the length of the chaining lists, and the average length of those lists depends on the quality of the hash function: better the hash function spreads the keys over buckets, more balanced and short the chaining lists are. Equal length linked lists is better than unequal linked lists, so we want the hash function to spread data the data out as equally as possible.

A similar situation is encountered in case the hash table is implemented with open addressing. The running time is governed by the length of the probe sequence: how many time you have to look in the hash table until a suitable empty bucket is found. The probe sequence is going to depend on the hash function. For good hash functions, that spread data evenly, the probe sequences will be relatively short. For a constant hash function, each new addition is a collision with increasingly long probe sequence.

In hashing, the standard in spreading data out is the performance of a completely random function. However, completely random hashing cannot be used for a very good reason: once a random hash has been used to store the key, we also need to retrieve the key, and this is not possible if we don't store the key - position association. If we do that, the solution devolves to a linked list, the lookup will be in linear time - we need to scan all the existing keys to find the position, negating the benefits of a hash table.

Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time.


⚠️ It is really easy to inadvertently design bad has functions, and bad has functions lead to poor hash table performance.

Properties of a Good Hash Function

  1. Should have good performance - constant running time.
  2. Should be easy to store.

Example of a Reasonably Good Hash Function

In the first part you take an object that is not intrinsically numeric and turn it into an integer, possibly a large integer. This stage is called "computing the hash code". The, the hash code so computed must be mapped over the bucket range {0, 1, ..., n-1}. This stage is called the "compression function".

HashFunction.png

Compute the Hash Code

Compression Function

The most common compression function is % n (Modulus n).

Choosing n

Pick a natural number n, which is close in size to │S│, where S is the subset of the universe of objects U that contains the objects we want to keep in memory. Also, because n will be most likely used as an operand of the modulus compression function on the hash code, it is best if n and the hash code integers do not share factors. The easiest way to achieve this is to choose n to be prime. The prime should not be close to patterns in the data.

Known Hash Functions

Universal Hashing

https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/1SQo3/pathological-data-sets-and-universal-hashing-motivation
https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/NA9cO/universal-hashing-definition-and-example-advanced-optional
https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional
https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional

Pathological Data Sets

Every hash function has a data set for which it performs really bad - its "pathological data set".

More details universal hashing, pathological data sets, etc. in: