Hash Table: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(115 intermediate revisions by the same user not shown)
Line 1: Line 1:
=External=
=External=
 
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/b2Uee/hash-tables-operations-and-applications
* http://en.wikipedia.org/wiki/Hash_table
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/Ob0K7/hash-tables-implementation-details-part-i
* https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/1Y8t1/hash-tables-implementation-details-part-ii


=Internal=
=Internal=
Line 9: Line 10:
* [[Distributed Hash Table|Distributed Hash Table]]
* [[Distributed Hash Table|Distributed Hash Table]]
* [[Java HashMap]]
* [[Java HashMap]]
 
* [[Hash Functions]]


=Overview=
=Overview=
=Hash Table Implementation Discussion=
Hash tables are one of the most used data structures in programming. They don't have that many operations ([[#INSERT|INSERT()]], [[#DELETE|DELETE()]] and [[#SEARCH|SEARCH()]]), but what they do, they do really well. Conceptually, a hash table is an array that provides immediate random access for [[#Constant_Time|constant time]] insertion, deletion and lookup based on '''an arbitrary key''' and not an integral index, for a dynamic set of keys. The hash map implementations do store data in proper arrays, and the mapping between the arbitrary key and the integral position in the array is provided by a [[#Hash_Functions|hash function]]. Hash maps are some times referred to as [[Data_Structures#Dictionary|dictionaries]], which is fine as long as the dictionary is not assumed to support queries on totally ordered sets: hash maps do not maintain the ordering of the elements that they contain.


=Supported Operations=
A variant of space-inefficient, but extremely time efficient hash table is a [[#Direct-Address_Table|direct-address table]].
==INSERT(X)==
[[Data_Structures#INSERT.28X.29|INSERT(X)]]
==DELETE(K)==
[[Data_Structures#DELETE.28X.29|DELETE(X)]]
==SEARCH(K)==
[[Data_Structures#SEARCH.28K.29|SEARCH(K)]]


=Canonical Use=
=Canonical Use=
Hash tables are used in situations where we need to do '''repeated fast lookups of arbitrary keys'''.
* [[The Deduplication Problem]]
* [[The 2-SUM Problem]]
* [[The 2-SUM Problem]]
* Symbol table lookups in compilers


=TO DEPLETE=
=Supported Operations=
All operations are O(1) on average. In case the hash table is implemented using [[#Chaining|chaining]], the running time is technically O(list_length). The performance is governed by the length of the chaining lists, and the length of the lists depends on the quality of the [[Hash_Functions|hash function]].
==<span id='INSERT'></span>INSERT(X)==
Add a new key/value pair [[Data_Structures#INSERT.28X.29|INSERT(X)]] to the hash map in O(1) running time.


=Overview_TODEPLETE=
==<span id='DELETE'></span>DELETE(K)==
Delete a key and the corresponding value [[Data_Structures#DELETE.28X.29|DELETE(K)]] from the hash map in O(1) running time.


A hash table is a [[Data_Structures#Dictionary|dictionary]], a [[Data Structures#Dynamic_Set|dynamic set]] that supports [[Data Structures#INSERT|INSERT]], [[Data Structures#DELETE|DELETE]] and [[Data Structures#SEARCH|SEARCH]] very efficiently. We will see that '''on average''', these operations are O(1).
==<span id='SEARCH'></span>SEARCH(K)==
Return the value corresponding to the provided key K [[Data_Structures#SEARCH.28K.29|SEARCH(K)]], or NULL if the key does not exist, in O(1) running time.


The simplest, and the most time-efficient implementation of a hash table is a [[#Direct-Address_Table|direct address table]]. However, this implementation is in most cases inefficient from the point of view of allocated space. A much more practical implementation uses an [[#Hash_Table|array whose length is proportional with the number of keys actually stored]]. The advantage of such data structure is that on average, SEARCH, INSERT and DELETE operations are O(1).
=Hash Table Implementation Discussion=
The purpose of a hash table is to maintain a dynamic set of records, each record identified by a unique key, and provide [[#Constant_Time|constant time]] insertion, deletion and search. This is implemented by converting the arbitrary key into a numeric index of a [[#Bucket|bucket]] where the key and the associated value is stored, by applying a [[Hash_Functions#Overview|hash function]] to the key. The conversion itself - the hash function running time - should be constant time.  


In case of a distributed hash table, each [[#Slot|slot]] may be implemented to reside in a different address space, accessible over the network. This case raises two major problems: a storage node may fail and become unavailable, so a solution for high availability must be implemented, and secondly, hash functions such as [[#Hashing_by_Division|hashing by division]] or [[#Hashing_by_Multiplication|hashing by multiplication]], assume a constant, unchanging number of slots, and if this number changes, all hash function values become invalid, requiring mass re-distribution, which is an event we must avoid. The problem is addressed by [[Consistent Hashing#Overview|consistent hashing]].
More formally, we aim to maintain, manage and look up a subset S of a universe of objects U. The cardinality of U is very large, but the cardinality of S is small - thousands, tens of thousand, in general something that fits in the memory of a computer. S is evolving, meaning that we need to add new elements to S and delete existing elements from S. Most importantly, we need to look up elements of S. The idea is that we pick a natural number n, which is close in size to │S│ - the size of S - and we maintain n buckets, where a <span id='Bucket'></span>'''bucket''' is the location in memory where a specific key/value pair is stored. More details on how to choose the number n are available in [[Hash_Functions#Choosing_n|Hash Functions - Choosing n]]. We can use an array A of size n, each element of the array being a bucket. We then choose a hash function:
<font size=-1>
h:U → {0,1,2 ..., n-1}
</font>
and we use the hash function result on the object (key) x to select the bucket to store the key and the value in: store x in A[h(x)].  


=Direct-Address Table=
This approach raises the important concept of [[#Collisions|collision]], which is fundamental to hash table implementation: a collision happens when two different keys produce the same hash function result, so they must be stored in the same bucket.


Direct-address tables make sense when we can afford to allocate an array that has one element for every possible key. This happens when the total number of possible keys is small. To represent a dynamic set, we use an array denoted by T[0 .. m-1], in which each position, or '''slot''', corresponds to a key in the key universe. The array elements store pointers to the object instances corresponding to the keys. If the set contains no element with a key k, then T[k] = NULL. INSERT, DELETE and SEARCH take O(1) time. The disadvantage of using direct-address tables is that if the universe of key is large, allocating an array element for each key is wasteful, and in many cases, impossible, given a computer's memory constraints.
The overall constant time guarantee of the hash table only holds if the hash function is properly implemented, and the data set is non-pathological. What "properly implemented" and "pathological data set" means will be addressed in length in the [[Hash Functions#Overview|Hash Functions]] section. Also see [[#Constant_Time_Discussion|Constant Time Discussion]] below.


=Hash Table=
==Collisions==
A collision happens when applying the hash functions on two different key produces the same result. The consequence of a collision is that we need to sore two different key/value pairs in the same bucket.  More formally, a collision is two distinct x, y ∈ U such that h(x) = h(y). There are two techniques to deal with collisions: [[#Chaining|chaining]] and [[#Open_Addressing|open addressing]]. Some times chaining is going to perform better, some times open addressing is going to perform better. If space is at a premium, open addressing is preferable, because chaining has a space overhead. Deletion is tricker when open addressing is used. Open addressing may perform better in presence of memory hierarchies.
===Chaining===
Each bucket does not merely contain just one element, but contains a linked list of elements that contains the elements that hashed to the same bucket. The list contains in principle an unbounded number of elements. The running time for chaining will be given by the quality of the hash function, a detailed discussion is available in the [[Hash Functions#Overview|Hash Functions]] section.


When the number of keys actually stored is small relative to the total number of possible keys, a '''hash table''' is much more effective alternative to a [[#Direct-Address_Table|direct-address table]]. A hash table uses an array whose length is proportional to the number of keys actually stored.
::[[File:HashTable_Chaining.png|400px]]


The index of the array element used to store a key, or a pointer to the object containing the key and other data, is ''computed'' from the key using a '''[[#Hash_Function|hash function]]''' h(k).  The array element used to store a key is called <span id='Slot'></span>'''slot''', or '''bucket'''. We say that an element with key k '''hashes''' to slot h(k). We also say that h(k) is the '''hash value''' of key k.
===Open Addressing===
Each bucket maintains just one element. In case of a collision, the hash table implementation probes the bucket array for an open position, multiple times if necessary, and keeps trying until it finds an open position. There are multiple strategies to find an open position. One is linear probing: try the next position, and if it is occupied, the next, and so. The second is to use two hashes: the first hash value provides the initial position, and the second hash value provides an offset from the initial position, in case that is occupied. The offset is used repeatedly until an open position is found. The running time for open addressing will be given by the quality of the hash function, a detailed discussion is available in the [[Hash Functions#Overview|Hash Functions]] section.


<span id='Collisions'></span>'''Collisions''' are possible. A collision is the situation in which more than one key maps on the same [[#Slot|slot]]. There are different techniques to handle those. [[#Chaining|Chaining]] is the most common. [[#Open_Addressing|Open addressing]] is another way.
<font color=darkkhaki>[[CLRS]] page 269.</font>


==<span id='Hash_Function'></span>Hash Functions==
==Load Factor==
There is a very important parameter that plays a big role in the performance of a hash table, and that is the '''load factor''' (or simply, the "load") of a hash table. Is reflects how populated a typical bucket of a hash table is:
<font size=-1>
      number of objects in the hash table
  α = ──────────────────────────────────────
                      n
</font>
where n is the number of buckets. As more objects are inserted in the hash table, the load grows. Keeping the number of objects in the table fixed as the number of buckets increases has the result of a decrease in load. Hash tables implemented using [[#Chaining|chaining]] can have a load factor bigger than 1. Hash tables implemented using [[#Open_Addressing|open addressing]] cannot have a load factor bigger than 1, because for open addressing, cannot be more objects than buckets in the hash table. In fact, for open addressing hash tables, the load factor should be maintained to be much less than 1 (α ≪ 1). A discussion on how the load factor influences the running time is available below in the [[#Constant_Time_Discussion|Constant Time Discussion]] section.


The hash function h() maps the universe U of keys to the slots of the hash table T[0 ... m-1], where the size m of the hash table is typically much smaller than the size of U:
α can be controlled at runtime by increasing (or decreasing) the number of buckets. If the number of buckets is changed, the data must be redistributed.


h : U → {0, 1, ..., m-1}
==Hash Functions==
{{Internal|Hash Functions#Overview|Hash Functions}}


A hash functions must be '''deterministic''', in that a given input k must always produce the same output h(k). A well designed hash function must also minimize the number of [[#Collision|collisions]]. This can be achieved if each key is equally likely to hash to any of the m slots, independently of where any other key has hashed to. This behavior is called <span id='Simple_Uniform_Hashing'></span>'''simple uniform hashing'''. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.
==<span id='Constant_Time'></span>Constant Time Discussion==


<span id='Keys as Natural Numbers'><span>'''Keys as Natural Numbers'''. All hash functions described below ([[#Hashing_by_Division|hashing by division]], [[#Hashing_by_Multiplication|hashing by multiplication]], etc.), assume that the keys are mapped to natural numbers, by some method. When the keys are natural numbers themselves, there is nothing more to be done. However, when the keys are strings, for example, they can be mapped to natural numbers by interpreting them as an integer expressed in a suitable radix notation: each character is expressed as its corresponding 0-127 [[Character_Encoding#US-ASCII|ASCII code]] and the position of the character in the string dictates the power 128 is raised to. The result is a natural number, which can grow quite large for long strings. In the context of an application, we can usually devise some method of translating a key into a natural number. The Java Object's hashCode() is aimed to provide this functionality - mapping an object key to an integer numbers.  
The constant time guarantees applies to properly implemented hash tables. Hash tables are easy to implement badly, on in those case, the constant time guarantee does not hold.  


===Hashing by Division===
The O(1) running time guarantee is '''on average''', assuming [[Hash_Functions#Simple_Uniform_Hashing|simple uniform hashing]].


The '''division method''' computes the hash value as the remainder when the key value is divided by a specified ''prime'' number. This gives good results if we choose a prime number that is unrelated to any patterns in the distribution of keys. We assume that a non-natural number key has been mapped to an natural number using the approach described in the [[#Keys_as_Natural_Numbers|Keys as Natural Numbers]] section.
The guarantee does not hold for every possible data set. For a specific [[Hash_Functions#Overview|hash function]], there is always a [[Hash_Functions#Pathological_Data_Sets|pathological data set]] for which all elements of the set hash to the same value, and in this case the constant time operations guarantee does not hold. More details about pathological data sets and techniques to work around this issue in [[Hash_Functions#Overview|Hash Functions]] section.


h(k) = k mod m
Another condition for constant running time is that the hash table [[#Load_Factor|load factor]] α must be O(1). If the load factor is a function of n, then intuitively, for a chaining-based implementation, the length of the chaining lists maintained for each bucket will not be constant and will impact the running time. Note that in this case we use a particular notation, n is the number of buckets, '''not the number of elements in the hash table'''.


'''Recommended values for m'''. We should avoid certain values of m. For example, m should not be a power of 2, because in that case k mod 2<sup>p</sup> is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing a hash function that depends on all bits of the key. A prime not too close to an exact power of 2 is often a good choice for m.
<font color=darkkhaki>[[CLRS]] Page 258.</font>


===Hashing by Multiplication===
=Direct-Address Table=


The '''multiplication method''' computes the hash value in two steps:
Direct-address tables make sense when we can afford to allocate an array that has one element for every possible key. This happens when the total number of possible keys is small. To represent a dynamic set, we use an array denoted by T[0 .. m-1], in which each position, or '''slot''', corresponds to a key in the key universe. The array elements store pointers to the object instances corresponding to the keys. If the set contains no element with a key k, then T[k] = NULL. INSERT, DELETE and SEARCH take O(1) time. The disadvantage of using direct-address tables is that if the universe of key is large, allocating an array element for each key is wasteful, and in many cases, impossible, given a computer's memory constraints.
# Multiply the k value with a constant in the range 0 < A < 1 and extract the fractional part of k⋅A.  
=Distributed Hash Tables=
# Multiply this value by m and take the floor of the result.
In case of a distributed hash table, each [[#Bucket|bucket]] may be implemented to reside in a different address space, accessible over the network. This case raises two major problems: a storage node may fail and become unavailable, so a solution for high availability must be implemented, and secondly, hash functions such as [[#Hashing_by_Division|hashing by division]] or [[#Hashing_by_Multiplication|hashing by multiplication]], assume a constant, unchanging number of buckets, and if this number changes, all hash function values become invalid, requiring mass re-distribution, which is an event we must avoid. The problem is addressed by [[Consistent Hashing#Overview|consistent hashing]].
 
=Organizatorium=
h(k) = ⌊m (k⋅A mod 1)⌋
* https://github.com/jwasham/coding-interview-university#hash-table
 
Knuth suggests that A = (√5 - 1)/2 will work well.
 
The advantage of the multiplication method is that is not sensitive to the value of m, so m can be chosen to be a power of 2.
 
===Universal Hashing===
 
If we use a fixed hash function such as hashing by division or hashing by multiplication, an adversary can choose keys that hash to the same slot, leading to Θ(n) search time. To address this problem, we can choose the hash function ''randomly'' in a way that is ''independent'' of the keys that are actually going to be stored. This is called '''universal hashing'''.
 
<font color=darkgray>TODO [[CLRS]] page 265.</font>
 
===Fibonacci Hashing===
 
{{External|https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo}}
 
==Collision Handling Techniques==
 
===Chaining===
 
Chaining is implemented by placing all elements that hash to the same [[#Slot|slot]] in a linked list, usually at the head. The slot contains a pointer to the head of the list, or NULL.
 
<center>[[File:HashTable_Chaining.png]]</center>
 
Note that for INSERT and SEARCH we need to scan the linked list. In case of INSERT, we want to know if the element with the given key isn't already in the list. For DELETE, if the list is doubly-linked, the operation is O(1) - note that the operation takes as input the element and not its key.
 
===Time and Space Complexity for Hash Table with Chaining===
 
In the ''worst case'', searching an element in a hash table can take as long as the search in a linked list Θ(n). However, on ''average'', and assuming [[#Simple_Uniform_Hashing|simple uniform hashing]], SEARCH, INSERT and DELETE operations take O(1). The space requirements is Θ(|K|), where K is the set of keys actually stored.
 
<font color=darkgray>TODO [[CLRS]] Page 258.</font>
 
===Open Addressing===
 
Open addressing is a technique employed when we need to deal with [[#Collision|collisions]] in a hash table. In open addressing, all elements occupy the hash table itself. When searching for an element, we systematically examine the table slots until we either find the desired element of the dynamic set or NULL.
 
<font color=darkgray>TODO [[CLRS]] page 269.</font>
 
=Perfect Hashing=
 
'''Perfect hashing''' can support searches in O(1) worst-case time, when the set of keys being stored is static - the set of keys never changes once it is stored.
 
=Miscellanea=
 
* [[SHA-1]]

Latest revision as of 17:28, 30 June 2022

External

Internal

Overview

Hash tables are one of the most used data structures in programming. They don't have that many operations (INSERT(), DELETE() and SEARCH()), but what they do, they do really well. Conceptually, a hash table is an array that provides immediate random access for constant time insertion, deletion and lookup based on an arbitrary key and not an integral index, for a dynamic set of keys. The hash map implementations do store data in proper arrays, and the mapping between the arbitrary key and the integral position in the array is provided by a hash function. Hash maps are some times referred to as dictionaries, which is fine as long as the dictionary is not assumed to support queries on totally ordered sets: hash maps do not maintain the ordering of the elements that they contain.

A variant of space-inefficient, but extremely time efficient hash table is a direct-address table.

Canonical Use

Hash tables are used in situations where we need to do repeated fast lookups of arbitrary keys.

Supported Operations

All operations are O(1) on average. In case the hash table is implemented using chaining, the running time is technically O(list_length). The performance is governed by the length of the chaining lists, and the length of the lists depends on the quality of the hash function.

INSERT(X)

Add a new key/value pair INSERT(X) to the hash map in O(1) running time.

DELETE(K)

Delete a key and the corresponding value DELETE(K) from the hash map in O(1) running time.

SEARCH(K)

Return the value corresponding to the provided key K SEARCH(K), or NULL if the key does not exist, in O(1) running time.

Hash Table Implementation Discussion

The purpose of a hash table is to maintain a dynamic set of records, each record identified by a unique key, and provide constant time insertion, deletion and search. This is implemented by converting the arbitrary key into a numeric index of a bucket where the key and the associated value is stored, by applying a hash function to the key. The conversion itself - the hash function running time - should be constant time.

More formally, we aim to maintain, manage and look up a subset S of a universe of objects U. The cardinality of U is very large, but the cardinality of S is small - thousands, tens of thousand, in general something that fits in the memory of a computer. S is evolving, meaning that we need to add new elements to S and delete existing elements from S. Most importantly, we need to look up elements of S. The idea is that we pick a natural number n, which is close in size to │S│ - the size of S - and we maintain n buckets, where a bucket is the location in memory where a specific key/value pair is stored. More details on how to choose the number n are available in Hash Functions - Choosing n. We can use an array A of size n, each element of the array being a bucket. We then choose a hash function:

h:U → {0,1,2 ..., n-1}

and we use the hash function result on the object (key) x to select the bucket to store the key and the value in: store x in A[h(x)].

This approach raises the important concept of collision, which is fundamental to hash table implementation: a collision happens when two different keys produce the same hash function result, so they must be stored in the same bucket.

The overall constant time guarantee of the hash table only holds if the hash function is properly implemented, and the data set is non-pathological. What "properly implemented" and "pathological data set" means will be addressed in length in the Hash Functions section. Also see Constant Time Discussion below.

Collisions

A collision happens when applying the hash functions on two different key produces the same result. The consequence of a collision is that we need to sore two different key/value pairs in the same bucket. More formally, a collision is two distinct x, y ∈ U such that h(x) = h(y). There are two techniques to deal with collisions: chaining and open addressing. Some times chaining is going to perform better, some times open addressing is going to perform better. If space is at a premium, open addressing is preferable, because chaining has a space overhead. Deletion is tricker when open addressing is used. Open addressing may perform better in presence of memory hierarchies.

Chaining

Each bucket does not merely contain just one element, but contains a linked list of elements that contains the elements that hashed to the same bucket. The list contains in principle an unbounded number of elements. The running time for chaining will be given by the quality of the hash function, a detailed discussion is available in the Hash Functions section.

HashTable Chaining.png

Open Addressing

Each bucket maintains just one element. In case of a collision, the hash table implementation probes the bucket array for an open position, multiple times if necessary, and keeps trying until it finds an open position. There are multiple strategies to find an open position. One is linear probing: try the next position, and if it is occupied, the next, and so. The second is to use two hashes: the first hash value provides the initial position, and the second hash value provides an offset from the initial position, in case that is occupied. The offset is used repeatedly until an open position is found. The running time for open addressing will be given by the quality of the hash function, a detailed discussion is available in the Hash Functions section.

CLRS page 269.

Load Factor

There is a very important parameter that plays a big role in the performance of a hash table, and that is the load factor (or simply, the "load") of a hash table. Is reflects how populated a typical bucket of a hash table is:

      number of objects in the hash table
 α = ──────────────────────────────────────
                      n

where n is the number of buckets. As more objects are inserted in the hash table, the load grows. Keeping the number of objects in the table fixed as the number of buckets increases has the result of a decrease in load. Hash tables implemented using chaining can have a load factor bigger than 1. Hash tables implemented using open addressing cannot have a load factor bigger than 1, because for open addressing, cannot be more objects than buckets in the hash table. In fact, for open addressing hash tables, the load factor should be maintained to be much less than 1 (α ≪ 1). A discussion on how the load factor influences the running time is available below in the Constant Time Discussion section.

α can be controlled at runtime by increasing (or decreasing) the number of buckets. If the number of buckets is changed, the data must be redistributed.

Hash Functions

Hash Functions

Constant Time Discussion

The constant time guarantees applies to properly implemented hash tables. Hash tables are easy to implement badly, on in those case, the constant time guarantee does not hold.

The O(1) running time guarantee is on average, assuming simple uniform hashing.

The guarantee does not hold for every possible data set. For a specific hash function, there is always a pathological data set for which all elements of the set hash to the same value, and in this case the constant time operations guarantee does not hold. More details about pathological data sets and techniques to work around this issue in Hash Functions section.

Another condition for constant running time is that the hash table load factor α must be O(1). If the load factor is a function of n, then intuitively, for a chaining-based implementation, the length of the chaining lists maintained for each bucket will not be constant and will impact the running time. Note that in this case we use a particular notation, n is the number of buckets, not the number of elements in the hash table.

CLRS Page 258.

Direct-Address Table

Direct-address tables make sense when we can afford to allocate an array that has one element for every possible key. This happens when the total number of possible keys is small. To represent a dynamic set, we use an array denoted by T[0 .. m-1], in which each position, or slot, corresponds to a key in the key universe. The array elements store pointers to the object instances corresponding to the keys. If the set contains no element with a key k, then T[k] = NULL. INSERT, DELETE and SEARCH take O(1) time. The disadvantage of using direct-address tables is that if the universe of key is large, allocating an array element for each key is wasteful, and in many cases, impossible, given a computer's memory constraints.

Distributed Hash Tables

In case of a distributed hash table, each bucket may be implemented to reside in a different address space, accessible over the network. This case raises two major problems: a storage node may fail and become unavailable, so a solution for high availability must be implemented, and secondly, hash functions such as hashing by division or hashing by multiplication, assume a constant, unchanging number of buckets, and if this number changes, all hash function values become invalid, requiring mass re-distribution, which is an event we must avoid. The problem is addressed by consistent hashing.

Organizatorium