Hash Functions: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 28: Line 28:


==Choosing n==
==Choosing n==
Pick a natural number n, which is close in size to │S│, where [[Hash_Table#Hash_Table_Implementation_Discussion|S is the subset of the universe of objects U that contains the objects we want to keep in memory]]. Also, because n will be most likely used as an operand of the modulus [[#Compression_Function|compression function]] on the hash code, it is best if n and the hash code integers do not share factors. The easiest way to achieve this is to choose n to be prime.
Pick a natural number n, which is close in size to │S│, where [[Hash_Table#Hash_Table_Implementation_Discussion|S is the subset of the universe of objects U that contains the objects we want to keep in memory]]. Also, because n will be most likely used as an operand of the modulus [[#Compression_Function|compression function]] on the [[#Compute_the_Hash_Code|hash code]], it is best if n and the hash code integers do not share factors. The easiest way to achieve this is to choose n to be prime.

Revision as of 20:59, 17 October 2021

External

Internal

Overview

The hash function should be easy to compute, in constant time. It should "spread out" most of the data sets, at least the non-pathological data sets. Also, the has function should be easy to store.

In case the hash table is implemented using chaining, the running times of all operations will be given by the length of the chaining lists, and the average length of those lists depends on the quality of the hash function: better the hash function spreads the keys over buckets, more balanced and short the chaining lists are. Equal length linked lists is better than unequal linked lists, so we want the hash function to spread data the data out as equally as possible.

A similar situation is encountered in case the hash table is implemented with open addressing. The running time is governed by the length of the probe sequence: how many time you have to look in the hash table until a suitable empty bucket is found. The probe sequence is going to depend on the hash function. For good hash functions, that spread data evenly, the probe sequences will be relatively short. For a constant hash function, each new addition is a collision with increasingly long probe sequence.

In hashing, the standard in spreading data out is the performance of a completely random function. However, completely random hashing cannot be used for a very good reason: once a random hash has been used to store the key, we also need to retrieve the key, and this is not possible if we don't store the key - position association. If we do that, the solution devolves to a linked list, the lookup will be in linear time - we need to scan all the existing keys to find the position, negating the benefits of a hash table.

Since every operation includes a hash function evaluation, if we want each operation to run in constant time, the hash function evaluation must also run in constant time.


⚠️ It is really easy to inadvertently design bad has functions, and bad has functions lead to poor hash table performance.

Properties of a Good Hash Function

  1. Should have good performance - constant running time.
  2. Should be easy to store.

Example of a Reasonably Good Hash Function

In the first part you take an object that is not intrinsically numeric and turn it into an integer, possibly a large integer. This stage is called "computing the hash code". The, the hash code so computed must be mapped over the bucket range {0, 1, ..., n-1}. This stage is called the "compression function".

HashFunction.png

Compute the Hash Code

Compression Function

"Modulus" compression function.

Choosing n

Pick a natural number n, which is close in size to │S│, where S is the subset of the universe of objects U that contains the objects we want to keep in memory. Also, because n will be most likely used as an operand of the modulus compression function on the hash code, it is best if n and the hash code integers do not share factors. The easiest way to achieve this is to choose n to be prime.