Hash Functions: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 10: Line 10:
  h:U → {0, 1, ..., n-1}
  h:U → {0, 1, ..., n-1}
</font>
</font>
The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the has function should be easy to store.
The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the hash function should be easy to store.


A hash functions must be '''deterministic''', in that a given input k must always produce the same output h(k). A well designed hash function must also minimize the number of [[Hash_Table#Collisions|collisions]]. This can be achieved if each key is equally likely to hash to any of the n buckets, independently of where any other key has hashed to. This behavior is called <span id='Simple_Uniform_Hashing'></span>'''simple uniform hashing'''. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.  
A hash functions must be '''deterministic''', in that a given input k must always produce the same output h(k).  
 
The hash function must have the following properties:
 
# If a == b, then h(a) == h(b)
# If h(a) == h(b), then a might be equal to b
# If h(a) != h(b) then a !=b
 
A well designed hash function must also minimize the number of [[Hash_Table#Collisions|collisions]]. This can be achieved if each key is equally likely to hash to any of the n buckets, independently of where any other key has hashed to. This behavior is called <span id='Simple_Uniform_Hashing'></span>'''simple uniform hashing'''. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.  


Collisions are handled by the hash table implementation with [[Hash_Table#Chaining|chaining]] or [[Hash_Table#Open_Addressing|open addressing]].
Collisions are handled by the hash table implementation with [[Hash_Table#Chaining|chaining]] or [[Hash_Table#Open_Addressing|open addressing]].
Line 64: Line 72:
{{External|https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional}}
{{External|https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional}}
{{External|https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional}}
{{External|https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional}}
If we use a fixed hash function such as hashing by division or hashing by multiplication, an adversary can choose keys that hash to the same slot, leading to Θ(n) search time. More generically, for any hash function and adversary can choose keys from the function's [[#Pathological_Data_Set|pathological data set]], to the same result. To address this problem, we can choose the hash function '''randomly''' in a way that is '''independent''' of the keys that are actually going to be stored. This is called '''universal hashing'''.
<font color=darkkhaki>TODO [[CLRS]] page 265.</font>
==Pathological Data Sets==
==Pathological Data Sets==
Every hash function has a data set for which it performs really bad - its "pathological data set".
Every hash function has a data set for which it performs really bad - its "pathological data set".
Line 72: Line 85:
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional.</font>
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional.</font>
=Perfect Hashing=
'''Perfect hashing''' can support searches in O(1) worst-case time, when the set of keys being stored is static - the set of keys never changes once it is stored.
=Fibonacci Hashing=
{{External|https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo}}

Latest revision as of 16:06, 11 September 2022

External

Internal

Overview

A hash function h() maps the universe U of keys to the slots of the hash table T[0 ... n-1], where the size n of the hash table is typically much smaller than the size of U:

h:U → {0, 1, ..., n-1}

The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the hash function should be easy to store.

A hash functions must be deterministic, in that a given input k must always produce the same output h(k).

The hash function must have the following properties:

  1. If a == b, then h(a) == h(b)
  2. If h(a) == h(b), then a might be equal to b
  3. If h(a) != h(b) then a !=b

A well designed hash function must also minimize the number of collisions. This can be achieved if each key is equally likely to hash to any of the n buckets, independently of where any other key has hashed to. This behavior is called simple uniform hashing. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.

Collisions are handled by the hash table implementation with chaining or open addressing.

In case the hash table is implemented using chaining, the running times of all operations will be given by the length of the chaining lists, and the average length of those lists depends on the quality of the hash function: better the hash function spreads the keys over buckets, more balanced and short the chaining lists are. Equal-length linked lists are better than unequal linked lists, so we want the hash function to spread data the data out as equally as possible.

A similar situation is encountered in case the hash table is implemented with open addressing. The running time is governed by the length of the probe sequence: how many time you have to look in the hash table until a suitable empty bucket is found. The probe sequence is going to depend on the hash function. For good hash functions, that spread data evenly, the probe sequences will be relatively short. For a constant hash function, each new addition is a collision with increasingly long probe sequence.

In hashing, the standard in spreading data out is the performance of a completely random function. However, completely random hashing cannot be used for a very good reason: once a random hash has been used to store the key, we also need to retrieve the key, and this is not possible if we don't store the key - position association. If we do that, the solution devolves to a linked list, the lookup will be in linear time - we need to scan all the existing keys to find the position, negating the benefits of a hash table.

Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time.


⚠️ It is really easy to inadvertently design bad has functions, and bad has functions lead to poor hash table performance.

Properties of a Good Hash Function

  1. Should have good performance - constant running time.
  2. Should be easy to store.

Example of a Reasonably Good Hash Function

In the first part you take an object that is not intrinsically numeric and turn it into an integer, possibly a large integer. This stage is called "computing the hash code". The, the hash code so computed must be mapped over the bucket range {0, 1, ..., n-1}. This stage is called the "compression function".

HashFunction.png

Compute the Hash Code

When the keys are natural numbers themselves, there is nothing more to be done. However, when the keys are strings, for example, they can be mapped to natural numbers by interpreting them as an integer expressed in a suitable radix notation: each character is expressed as its corresponding 0-127 ASCII code and the position of the character in the string dictates the power 128 is raised to. The result is a natural number, which can grow quite large for long strings. In the context of an application, we can usually devise some method of translating a key into a natural number. The Java Object's hashCode() is aimed to provide this functionality - mapping an object key to an integer numbers.

Compression Function

Hashing by Division

The division method computes the hash value as the remainder when the key value is divided by a specified prime number. This gives good results if we choose a prime number that is unrelated to any patterns in the distribution of keys. We assume that a non-natural number key has been mapped to a natural number using the approach described in the Compute the Hash Code section.

h(k) = k mod n

Choosing n

Pick a natural number n, which is close in size to │S│, where S is the subset of the universe of objects U that contains the objects we want to keep in memory. Also, because n will be most likely used as an operand of the modulus compression function on the hash code, it is best if n and the hash code integers do not share factors. The easiest way to achieve this is to choose n to be prime. The prime should not be close to patterns in the data.

n should not be a power of 2, because in that case k mod 2p is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing a hash function that depends on all bits of the key. A prime not too close to an exact power of 2 is often a good choice for n.

Hashing by Multiplication

The multiplication method computes the hash value in two steps:

  1. Multiply the k value with a constant in the range 0 < A < 1 and extract the fractional part of k⋅A.
  2. Multiply this value by n and take the floor of the result.

h(k) = ⌊n (k⋅A mod 1)⌋

Knuth suggests that A = (√5 - 1)/2 will work well. The advantage of the multiplication method is that is not sensitive to the value of n, so n can be chosen to be a power of 2.

Known Hash Functions

Universal Hashing

https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/1SQo3/pathological-data-sets-and-universal-hashing-motivation
https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/NA9cO/universal-hashing-definition-and-example-advanced-optional
https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/YhHRN/universal-hashing-analysis-of-chaining-advanced-optional
https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/3bPLu/hash-table-performance-with-open-addressing-advanced-optional

If we use a fixed hash function such as hashing by division or hashing by multiplication, an adversary can choose keys that hash to the same slot, leading to Θ(n) search time. More generically, for any hash function and adversary can choose keys from the function's pathological data set, to the same result. To address this problem, we can choose the hash function randomly in a way that is independent of the keys that are actually going to be stored. This is called universal hashing.

TODO CLRS page 265.

Pathological Data Sets

Every hash function has a data set for which it performs really bad - its "pathological data set".

More details universal hashing, pathological data sets, etc. in:

Perfect Hashing

Perfect hashing can support searches in O(1) worst-case time, when the set of keys being stored is static - the set of keys never changes once it is stored.

Fibonacci Hashing

https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo