Find Strongly Connected Components in a Directed Graph: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(26 intermediate revisions by the same user not shown)
Line 9: Line 9:
Finding strongly connected components in a directed graph is a form of clustering heuristics: strongly connected components represent clusters where the objects represented by the vertices are clustered in some way.
Finding strongly connected components in a directed graph is a form of clustering heuristics: strongly connected components represent clusters where the objects represented by the vertices are clustered in some way.


[[Graph_Concepts#Strongly_Connected_Component|Strongly connected components]] of a directed graph can be computed with Kosaraju's Two-Pass algorithm, which consists of two passes of [[Graph_Search#DFS_and_Directed_Connectivity_-_Compute_Strong_Components|depth-first search]]. The key idea of the algorithm is that for directed graphs, starting a depth-first search from the "right" nodes discovers strongly connected components, while starting from the "wrong" nodes may discover the entire graph, which is not useful. The first pass of the algorithm computes the right order in which to use the nodes as start nodes in the second depth-first search pass.
[[Graph_Concepts#Strongly_Connected_Component|Strongly connected components]] of a directed graph can be computed with Kosaraju's Two-Pass algorithm, which consists of two passes of [[Graph_Search#DFS_and_Directed_Connectivity_-_Compute_Strong_Components|depth-first search]]. In consequence, the running time is O(n + m). The key idea of the algorithm is that for directed graphs, starting a depth-first search from the "right" nodes discovers strongly connected components, while starting from the "wrong" nodes may discover the entire graph, which is not useful. The first pass of the algorithm computes the right order in which to use the nodes as start nodes in the second depth-first search pass.


=Kosaraju's Two-Pass Algorithm=
=Kosaraju's Two-Pass Algorithm=
The algorithm has three steps:
The algorithm has three steps:


1. Reverse all of the arcs of the given graph. Let G<sup>rev</sup> = G with all arcs reversed.  
1. '''Reverse the Arcs'''. Reverse all of the arcs of the given graph. Let G<sup>rev</sup> = G with all arcs reversed.  


::Note that reversing the arcs helps with reasoning about the problem. A practical implementation can simply run DFS on the original graph G but going backwards, in the opposite direction of the arcs, instead of creating a copy of the graph where the arcs are effectively reversed. The [[Graph_Search#DFS_Recursive_Algorithm_Pseudocode|recursive depth-first search implementation]] can be used as is, with just a minimal modification: when looking for edges to explore, use edges '''incident to''' the current node instead of edges '''incident from''' the current node. A pseudocode implementation will be provided below.
::Note that reversing the arcs helps with reasoning about the problem. A practical implementation can simply run DFS on the original graph G but going backwards, in the opposite direction of the arcs, instead of creating a copy of the graph where the arcs are effectively reversed. The [[Graph_Search#DFS_Recursive_Algorithm_Pseudocode|recursive depth-first search implementation]] can be used as is, with just a minimal modification: when looking for edges to explore, use edges '''incident to''' the current node instead of edges '''incident from''' the current node. A pseudocode implementation will be provided below.


2. Do the first [[Graph_Search#DFS_Recursive_Algorithm_Pseudocode|depth-first search]] pass on the reversed graph G<sup>rev</sup>, from node n to node 1. This pass computes an ordering of nodes that allows the second pass to discover the strongly connected components. The pass labels each node with its "finishing time". Let f(v) = "finishing time" of each v ∈ V. Once f(v) for each node is computed at the end of the first pass, the nodes are relabeled with their finishing time.
2. '''First Pass DFS'''. Do the first depth-first search pass on the reversed graph G<sup>rev</sup>, from node n to node 1. This pass computes an ordering of nodes that allows the second pass to discover the strongly connected components. The pass labels each node with its "finishing time". Let f(v) = "finishing time" of each v ∈ V. Once f(v) for each node is computed at the end of the first pass, the nodes are relabeled with their finishing time.
<font size=-1>
<font size=-1>
  First_Pass(graph G)
  First_Pass(graph G)
Line 29: Line 29:
</font>
</font>


The <code>Reverse_DFS()</code> is similar to the standard recursive DFS implementation, with the exception that it uses the edges incident to the current nodes instead of the edges incident from the current node, and also computes the finishing time for each node:
The <code>Reverse_DFS()</code> is similar to the [[Graph_Search#DFS_Recursive_Algorithm_Pseudocode|standard recursive DFS implementation]], with the exception that it uses the edges incident to the current nodes instead of the edges incident from the current node, and also computes the finishing time for each node:


<font size=-1>
<font size=-1>
Line 37: Line 37:
     if v unexplored:
     if v unexplored:
       Reverse_DFS(G, v)
       Reverse_DFS(G, v)
  <font color=teal># ⚠️ We are currently assuming nodes labeled from 1 to n. For zero-based node labels, set the finishing time first, then increment t</font>
   t++ <font color=teal># Increment the finishing time, in the first pass</font>
   t++ <font color=teal># Increment the finishing time, in the first pass</font>
   set f(s) = t <font color=teal># Set the finishing time on the current node</font>
   set f(s) = t <font color=teal># Set the finishing time on the current node</font>
</font>
</font>


3. '''Second Pass DFS'''. Do the second depth-first search pass again, on the original graph G, with the arcs "restored" in the original direction. The graph vertices must be processed in the decreasing order of the "finishing time" computed during the first pass, so the entire graph must be relabeled and re-ordered using the "finishing time" as vertex labels. This re-labeling should be performed in linear time. The pseudocode of a possible O(n + m) relabelling algorithm that uses additional memory is presented below. Note that by processing the graph we reset the "explored" states on the nodes. The re-labeled graph nodes are all "unexplored".


<font size=-1>
Relabel_With_Finishing_Time(graph G)
  <font color=teal># Allocate a new adj2 adjacency list representation for the entire graph G
  # adj is the current representation carrying the old labeling</font>
  adj2 = new Vertex[adj.size]
  for each o in adj <font color=teal># For each old-labeled vertex</font>
    adj2[o.finishingTime] = new Vertex()
    <font color=teal># Rewrite the adjacency list of the newly create vertex using the finishing time of the old ajacencies</font>
    for i in o.adjacencyList
        adj2[o.finishingTime].addHead(adj[i].finishingTime)
  <font color=teal># Swap adj with adj2</font>
  adj.clear()
  for v in adj2
    adj.add(v)
</font>


Once the graph is re-labeled, the second pass runs depth-first search recursively on unexplored nodes, and discovers the strongly connected components one by one. During this second pass the algorithm labels each node with its "leader" ID, which is the node we start the recursive depth-first search on. The idea is all nodes in the same strongly connected component will be labeled with the same leader node.


3. Do the second [[Graph_Search#DFS_Recursive_Algorithm_Pseudocode|depth-first search]] pass again, on the original graph G. The "right" order is to go through the vertices in the decreasing order of the "finishing time" computed during the first pass. In this pass we use the order computed by the first pass - remember that the nodes have been relabeled according to their computed finishing time - and we discover the strongly connected components one by one. During this second pass we will label each node with its "leader". The idea is all nodes in the same strongly connected component will be labeled with the same leader node.
<font color=darkgray>TODO: rewrite this in a clearer pseudocode with phases clearly separated.</font>
<font size=-1>
<font size=-1>
  DFS_Loop(graph G)
  Second_Pass(graph G)
   global variable t = 0 <font color=teal># Counts the total number of nodes processed so far.
   global variable leader = NULL <font color=teal># Keeps track of the most recent vertex from which the DFS was initiated (the leader).</font>
                        # Used in the first DFS pass to compute the "finishing time".</font>
   <font color=teal># Assume nodes re-labelled 1 to n using finishing times</font>
  global variable s = NULL <font color=teal># Keeps track of the most recent vertex from which the DFS
                            # was initiated (the leader). Used in the second pass.</font>
   Assume nodes labelled 1 to n.
   for i = n down to 1
   for i = n down to 1
    if i not yet explored
    if i not yet explored
      s = i
      leader = i
      DFS(G, i)
      DFS(G, i)
</font>
</font>
<code>DFS()</code> is almost identical to the [[Graph_Search#DFS_Recursive_Algorithm_Pseudocode|standard recursive DFS implementation]], with the exception that it annotates each node it processes with its leader, maintained by <code>Second_Pass(graph G)</code>:


<font size=-1>
<font size=-1>
  DFS(graph G, node i)
  DFS(graph G, start vertex s)
   mark i as explored <font color=teal># Once the node is marked explored, it is explored for the rest of DFS_Loop</font>
   mark s as explored <font color=teal># Once the node is marked explored, it is explored for the rest of Second_Pass() loop</font>
   set leader(i) = s <font color=teal># Mark the node with its "leader" in the second pass</font>
   set leader(s) = leader <font color=teal># Mark the node with its "leader" in the second pass</font>
   for ech arc (i, j) ∈ G
   for every edge '''incident from''' s (s → v):
     if j not yet explored
     if v unexplored:
       DFS(G, j)
       DFS(G, v)
  t ++        <font color=teal># Increment the finishing time, in the first pass</font>
  set f(i) = t <font color=teal># Finishing time of i, in the first pass</font>
</font>
</font>
At the end of the second loop, each node is annotate with its leader.
The state of the graph allows to linearly scan the list of node, identify distinct leaders and compute the SCCs.
==Playground==
{{External|https://github.com/ovidiuf/playground/tree/master/learning/stanford-algorithms-specialization/05-strongly-conected-components/src/main/java/playground/stanford/scc}}


=Proof=
=Proof=
{{External|https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/QFOFt/computing-strong-components-the-analysis}}
{{External|https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/QFOFt/computing-strong-components-the-analysis}}

Latest revision as of 01:46, 16 October 2021

External

Internal

Overview

Finding strongly connected components in a directed graph is a form of clustering heuristics: strongly connected components represent clusters where the objects represented by the vertices are clustered in some way.

Strongly connected components of a directed graph can be computed with Kosaraju's Two-Pass algorithm, which consists of two passes of depth-first search. In consequence, the running time is O(n + m). The key idea of the algorithm is that for directed graphs, starting a depth-first search from the "right" nodes discovers strongly connected components, while starting from the "wrong" nodes may discover the entire graph, which is not useful. The first pass of the algorithm computes the right order in which to use the nodes as start nodes in the second depth-first search pass.

Kosaraju's Two-Pass Algorithm

The algorithm has three steps:

1. Reverse the Arcs. Reverse all of the arcs of the given graph. Let Grev = G with all arcs reversed.

Note that reversing the arcs helps with reasoning about the problem. A practical implementation can simply run DFS on the original graph G but going backwards, in the opposite direction of the arcs, instead of creating a copy of the graph where the arcs are effectively reversed. The recursive depth-first search implementation can be used as is, with just a minimal modification: when looking for edges to explore, use edges incident to the current node instead of edges incident from the current node. A pseudocode implementation will be provided below.

2. First Pass DFS. Do the first depth-first search pass on the reversed graph Grev, from node n to node 1. This pass computes an ordering of nodes that allows the second pass to discover the strongly connected components. The pass labels each node with its "finishing time". Let f(v) = "finishing time" of each v ∈ V. Once f(v) for each node is computed at the end of the first pass, the nodes are relabeled with their finishing time.

First_Pass(graph G)
  global variable t = 0 # Counts the total number of nodes processed so far in the reverse DFS routine. 
                        # Used by the reverse DFS routine to compute the "finishing time".
  # Assume nodes labelled 1 to n
  for i = n down to 1
   if i not yet explored
     Reverse_DFS(G, i)

The Reverse_DFS() is similar to the standard recursive DFS implementation, with the exception that it uses the edges incident to the current nodes instead of the edges incident from the current node, and also computes the finishing time for each node:

Reverse_DFS(graph G, start vertex s)
  mark s as explored
  for every edge incident to s (s ← v):
    if v unexplored:
      Reverse_DFS(G, v)
  # ⚠️ We are currently assuming nodes labeled from 1 to n. For zero-based node labels, set the finishing time first, then increment t
  t++ # Increment the finishing time, in the first pass
  set f(s) = t # Set the finishing time on the current node

3. Second Pass DFS. Do the second depth-first search pass again, on the original graph G, with the arcs "restored" in the original direction. The graph vertices must be processed in the decreasing order of the "finishing time" computed during the first pass, so the entire graph must be relabeled and re-ordered using the "finishing time" as vertex labels. This re-labeling should be performed in linear time. The pseudocode of a possible O(n + m) relabelling algorithm that uses additional memory is presented below. Note that by processing the graph we reset the "explored" states on the nodes. The re-labeled graph nodes are all "unexplored".

Relabel_With_Finishing_Time(graph G)
  # Allocate a new adj2 adjacency list representation for the entire graph G
  # adj is the current representation carrying the old labeling
  adj2 = new Vertex[adj.size]
  for each o in adj # For each old-labeled vertex
    adj2[o.finishingTime] = new Vertex() 
    # Rewrite the adjacency list of the newly create vertex using the finishing time of the old ajacencies
    for i in o.adjacencyList 
       adj2[o.finishingTime].addHead(adj[i].finishingTime)
  # Swap adj with adj2
  adj.clear()
  for v in adj2
    adj.add(v)

Once the graph is re-labeled, the second pass runs depth-first search recursively on unexplored nodes, and discovers the strongly connected components one by one. During this second pass the algorithm labels each node with its "leader" ID, which is the node we start the recursive depth-first search on. The idea is all nodes in the same strongly connected component will be labeled with the same leader node.

Second_Pass(graph G)
  global variable leader = NULL # Keeps track of the most recent vertex from which the DFS was initiated (the leader).
  # Assume nodes re-labelled 1 to n using finishing times
  for i = n down to 1
   if i not yet explored
     leader = i
     DFS(G, i)

DFS() is almost identical to the standard recursive DFS implementation, with the exception that it annotates each node it processes with its leader, maintained by Second_Pass(graph G):

DFS(graph G, start vertex s)
  mark s as explored # Once the node is marked explored, it is explored for the rest of Second_Pass() loop
  set leader(s) = leader # Mark the node with its "leader" in the second pass
  for every edge incident from s (s → v):
    if v unexplored:
      DFS(G, v)

At the end of the second loop, each node is annotate with its leader.

The state of the graph allows to linearly scan the list of node, identify distinct leaders and compute the SCCs.

Playground

https://github.com/ovidiuf/playground/tree/master/learning/stanford-algorithms-specialization/05-strongly-conected-components/src/main/java/playground/stanford/scc

Proof

https://www.coursera.org/learn/algorithms-graphs-data-structures/lecture/QFOFt/computing-strong-components-the-analysis