External

Internal

Overview

A balanced binary search tree exposes all its nodes via at most log n node traversals. However, if some of the search terms are more accessed than others, and we know a priori the relative access frequency, a balanced search tree might not be optimal. Instead, a tree that keeps the most frequently accessed nodes higher up in the tree offers better performance overall. An optimal search tree minimize the average search time with respect to a given set of probabilities over the keys. The cost of searching in the tree is given by the number of node assessments: if the key we look for is in the root, we only need to look at the root, so the number of node we need to assess is 1, if the key is immediately under the root, the cost is 2, and so on.

Optimal Substructure Lemma

If T is an optimal binary search tree for the keys {1, 2, ... n} with root r, and the keys in sorted order, then its subtrees T₁ and T₂ are optimal binary search trees for the keys {1, 2, ... r-1} and {r+1, ..., n} respectively.

Dynamic Programming Algorithm

The essence of a dynamic programming algorithm is to express the solution of larger problems as a function of solution of smaller problems. In the optimal binary search tree case, the smaller subproblems can obtained by either throwing away a prefix (the subtree T₁) or a suffix (the subtree T₂) of the original problem.

For an interval of keys 1 ≤ i ≤ j ≤ n, let C_ij be the weighted search cost of an optimal binary search tree for items {i, i+1, ..., j-1, j}. The corresponding key probabilities are p_i, p_i+1, ..., p_j.

The recurrence formula is, for every 1 ≤ i ≤ j ≤ n:

      j   j   
C_ij= min{ ∑ p_k + C_i(r-1) + C_(r+1)j }
     r=i k=i

The algorithm:

Let A be a bidimensional array # A[i,j] represents optimal BST value for items {i,...,j}
for s = 0 to (n-1) # s represents (j-i), so i+s plays the role of j
  for i = 1 to n
    if (i + s > n) continue
                i+s  i+s
    A[i, i+s] = min { ∑ p_k + A[i,r-1] + A[r+i,i+s] } # interpret as 0 if the first index is bigger than the second index
                r=i  k=i
return A[1,n]

Running Time

There are Θ(n²) subproblems. However, the subproblems are NOT Θ(1), some of them are Θ(n) (Θ(j-i) to compute A[i,j] so the overall running time is Θ(n³).

Playground Implementation

https://github.com/ovidiuf/playground/blob/master/learning/stanford-algorithms-specialization/15-dynamic-programming-optimal-binary-search-tree/src/main/java/playground/stanford/obst/Main.java

Optimal Binary Search Trees

Contents