Kruskal's Algorithm

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Overview

Kruskal's algorithms is a greedy algorithm that computes the minimum cost spanning tree for an undirected connected graph. It has a different approach than Prim's algorithm, in that it does not grow the connected explored territory one node at a time, but it grows the trees in parallel with a lot of simultaneous little pieces, unconnected at first, which eventually merge into the minimum cost spanning tree. It does that by sorting the edges in the ascending order of their cost, and picking edges in that order, one by one, careful not to create cycles. The greediness of the algorithm comes from the fact that at each step, it chooses an edge by attempting to minimize a greedy score, which is the cost of the edge.

KruskalAlgorithm.png

Non-Optimized Implementation

sort edges in the order of increasing cost
initialize T = ∅ # the tree we are growing
for edge e = (u,v) in the order of increasing cost: # at most m iteration
  if T ⋃ e has no cycles # there is no path in T from u to v
    add e to T

We could count the edges added to the spanning tree, and once we have n-1 edges, which means the spanning tree is complete, we can abort the loop.

One critical step of the algorithm is checking whether a new edge e=(u,v) introduces no cycles in T, and this can be checked in several ways. The non-optimized way is to run a graph search, BFS or DFS, starting from u, attempting to find a path from u to v. Note that we don't need to run a full BFS or DFS on all edges, we only need to test connectivity between the nodes included in the partial spanning trees discovered so far, so that will reduce the graph search running time.

A better way would be to use a union-find.

Non-Optimized Implementation Running Time

Sorting the edges will take O(m log m) time. However, since we assume that the number of edges is at most quadratic of the number of vertices, m = O(n2), so O(log m) = O(log n2)=2O(log n)=O(log n). In consequence, the running time of the edge sorting step is O(m log n).

The main loop has at most m iterations. In each iteration, we need to ensure that the current edge e=(u,v) does not form a cycle in T, which means we need to ensure that there is no path from u to v.

The non-optimized solution checks whether there is no path from u to v using either BFS or DFS starting from u. In their general form, both methods have a running time O(n + m) linear in the number of n and m. However, in this case we only need to search through a limited number of edges, those that are already in T. T has at most n vertices, to it has at most n-1 = O(n) edges, so the O(m + n) running time from the general case becomes O(n). The search is repeated for m times, so the "growing" phase of the spanning tree is O(m n).

The total running time of the non-optimized algorithm is O(m log n) + O(m n) = O(m n). This is the same running time as for the non-optimized implementation of the Prim's algorithm.

Correctness Proof

https://www.coursera.org/learn/algorithms-greedy/lecture/U3ukN/correctness-of-kruskals-algorithm

Optimized Implementation

https://www.coursera.org/learn/algorithms-greedy/lecture/e0TJP/implementing-kruskals-algorithm-via-union-find-i
https://www.coursera.org/learn/algorithms-greedy/lecture/TvDMg/implementing-kruskals-algorithm-via-union-find-ii

Kruskal's algorithm can be optimized by speeding up the cycle check which is repeatedly done in the m-sized edge-by-edge loop. This is achieved by using a union-find-like data structure, where vertices keep pointers to their connected component's "leader".

Checking for connectivity in this union-find-like data structure is constant time. Performing the union operation on the data structure is O(log n). The running time of the loop is O(m log n), no worse than the edge sorting phase, which will give the overall running time of O(m log n).

Incomplete, continue.