Hash Table: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 19: Line 19:
==SEARCH()==
==SEARCH()==
[[Data_Structures#SEARCH.28K.29|SEARCH(K)]]
[[Data_Structures#SEARCH.28K.29|SEARCH(K)]]
=Canonical Use=


=TO DEPLETE=
=TO DEPLETE=

Revision as of 18:42, 15 October 2021

External

Internal


Overview

Supported Operations

INSERT()

INSERT(X)

DELETE()

DELETE(X)

SEARCH()

SEARCH(K)

Canonical Use

TO DEPLETE

Overview_TODEPLETE

A hash table is a dictionary, a dynamic set that supports INSERT, DELETE and SEARCH very efficiently. We will see that on average, these operations are O(1).

The simplest, and the most time-efficient implementation of a hash table is a direct address table. However, this implementation is in most cases inefficient from the point of view of allocated space. A much more practical implementation uses an array whose length is proportional with the number of keys actually stored. The advantage of such data structure is that on average, SEARCH, INSERT and DELETE operations are O(1).

In case of a distributed hash table, each slot may be implemented to reside in a different address space, accessible over the network. This case raises two major problems: a storage node may fail and become unavailable, so a solution for high availability must be implemented, and secondly, hash functions such as hashing by division or hashing by multiplication, assume a constant, unchanging number of slots, and if this number changes, all hash function values become invalid, requiring mass re-distribution, which is an event we must avoid. The problem is addressed by consistent hashing.

Direct-Address Table

Direct-address tables make sense when we can afford to allocate an array that has one element for every possible key. This happens when the total number of possible keys is small. To represent a dynamic set, we use an array denoted by T[0 .. m-1], in which each position, or slot, corresponds to a key in the key universe. The array elements store pointers to the object instances corresponding to the keys. If the set contains no element with a key k, then T[k] = NULL. INSERT, DELETE and SEARCH take O(1) time. The disadvantage of using direct-address tables is that if the universe of key is large, allocating an array element for each key is wasteful, and in many cases, impossible, given a computer's memory constraints.

Hash Table

When the number of keys actually stored is small relative to the total number of possible keys, a hash table is much more effective alternative to a direct-address table. A hash table uses an array whose length is proportional to the number of keys actually stored.

The index of the array element used to store a key, or a pointer to the object containing the key and other data, is computed from the key using a hash function h(k). The array element used to store a key is called slot, or bucket. We say that an element with key k hashes to slot h(k). We also say that h(k) is the hash value of key k.

Collisions are possible. A collision is the situation in which more than one key maps on the same slot. There are different techniques to handle those. Chaining is the most common. Open addressing is another way.

Hash Functions

The hash function h() maps the universe U of keys to the slots of the hash table T[0 ... m-1], where the size m of the hash table is typically much smaller than the size of U:

h : U → {0, 1, ..., m-1}

A hash functions must be deterministic, in that a given input k must always produce the same output h(k). A well designed hash function must also minimize the number of collisions. This can be achieved if each key is equally likely to hash to any of the m slots, independently of where any other key has hashed to. This behavior is called simple uniform hashing. Typically we have no way to check this condition, as we rarely know the probability distribution from which the keys are drawn. Moreover, the key might not be drawn independently. A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data.

Keys as Natural Numbers. All hash functions described below (hashing by division, hashing by multiplication, etc.), assume that the keys are mapped to natural numbers, by some method. When the keys are natural numbers themselves, there is nothing more to be done. However, when the keys are strings, for example, they can be mapped to natural numbers by interpreting them as an integer expressed in a suitable radix notation: each character is expressed as its corresponding 0-127 ASCII code and the position of the character in the string dictates the power 128 is raised to. The result is a natural number, which can grow quite large for long strings. In the context of an application, we can usually devise some method of translating a key into a natural number. The Java Object's hashCode() is aimed to provide this functionality - mapping an object key to an integer numbers.

Hashing by Division

The division method computes the hash value as the remainder when the key value is divided by a specified prime number. This gives good results if we choose a prime number that is unrelated to any patterns in the distribution of keys. We assume that a non-natural number key has been mapped to an natural number using the approach described in the Keys as Natural Numbers section.

h(k) = k mod m

Recommended values for m. We should avoid certain values of m. For example, m should not be a power of 2, because in that case k mod 2p is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing a hash function that depends on all bits of the key. A prime not too close to an exact power of 2 is often a good choice for m.

Hashing by Multiplication

The multiplication method computes the hash value in two steps:

  1. Multiply the k value with a constant in the range 0 < A < 1 and extract the fractional part of k⋅A.
  2. Multiply this value by m and take the floor of the result.
h(k) = ⌊m (k⋅A mod 1)⌋

Knuth suggests that A = (√5 - 1)/2 will work well.

The advantage of the multiplication method is that is not sensitive to the value of m, so m can be chosen to be a power of 2.

Universal Hashing

If we use a fixed hash function such as hashing by division or hashing by multiplication, an adversary can choose keys that hash to the same slot, leading to Θ(n) search time. To address this problem, we can choose the hash function randomly in a way that is independent of the keys that are actually going to be stored. This is called universal hashing.

TODO CLRS page 265.

Fibonacci Hashing

https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo

Collision Handling Techniques

Chaining

Chaining is implemented by placing all elements that hash to the same slot in a linked list, usually at the head. The slot contains a pointer to the head of the list, or NULL.

HashTable Chaining.png

Note that for INSERT and SEARCH we need to scan the linked list. In case of INSERT, we want to know if the element with the given key isn't already in the list. For DELETE, if the list is doubly-linked, the operation is O(1) - note that the operation takes as input the element and not its key.

Time and Space Complexity for Hash Table with Chaining

In the worst case, searching an element in a hash table can take as long as the search in a linked list Θ(n). However, on average, and assuming simple uniform hashing, SEARCH, INSERT and DELETE operations take O(1). The space requirements is Θ(|K|), where K is the set of keys actually stored.

TODO CLRS Page 258.

Open Addressing

Open addressing is a technique employed when we need to deal with collisions in a hash table. In open addressing, all elements occupy the hash table itself. When searching for an element, we systematically examine the table slots until we either find the desired element of the dynamic set or NULL.

TODO CLRS page 269.

Perfect Hashing

Perfect hashing can support searches in O(1) worst-case time, when the set of keys being stored is static - the set of keys never changes once it is stored.

Miscellanea