Maximize K-Clustering Spacing: Difference between revisions
(Created page with "=Internal= * Clustering Concepts =Overview=") |
|||
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=External= | |||
* https://www.coursera.org/learn/algorithms-greedy/lecture/QWubN/application-to-clustering | |||
=Internal= | =Internal= | ||
* [[Clustering_Concepts#Clustering_Algorithms|Clustering Concepts]] | * [[Clustering_Concepts#Clustering_Algorithms|Clustering Concepts]] | ||
=Overview= | =Overview= | ||
This is a greedy algorithm that aims to maximize [[Clustering_Concepts#Spacing_of_a_K-Clustering|spacing]] between any two clusters. By spacing between two clusters we mean the minimum distance between any two separated points, which belong to two distinct clusters: | |||
<font size=-1> | |||
min<sub>separated p, q</sub>d(p,q) | |||
</font> | |||
"Good" clustering means that all of the separated points should be as far apart as possible. | |||
=Algorithm= | |||
The conceptual pseudocode to achieve k-clustering with maximum possible spacing is: | |||
<font size=-1> | |||
initially each point is a separate cluster | |||
repeat until only k clusters: | |||
let p,q = closest pair of separated points <font color=teal># determines the current spacing</font> | |||
merge the clusters containing p and q into a single cluster | |||
</font> | |||
This algorithm is a [[Clustering_Concepts#Single-Link_Clustering|single-link clustering]] algorithm, in that it refines the clustering by fusing clusters one point at a time. It is very similar in structure to the [[Kruskal's Algorithm|Kruskal's algorithm]]. A practical implementation that sorts the distances in a pre-processing steps, in a similar manner to how the Kruskal's algorithm does it, and then manages clusters with an [[Union-Find#Overview|union-find]] is: | |||
<font size=-1> | |||
initialize union-find U <font color=teal># each point belongs initially to its own separate one-point cluster</font> | |||
sort distances <font color=teal># distances between point are sorted from smallest to largest</font> | |||
for d=(p,q) in sorted_distances: <font color=teal># process distances from smallest to largest</font> | |||
if U.find(p) != U.find(q): <font color=teal># points belong to different clusters</font> | |||
if U.clusterCount() == k: | |||
<font color=teal># we reached the target cluster count, return the spacing between clusters</font> | |||
return d | |||
else: | |||
U.union(cluster(p), cluster(q)) <font color=teal># merge the clusters</font> | |||
</font> | |||
==Implementation== | |||
{{External|https://github.com/ovidiuf/playground/blob/master/learning/stanford-algorithms-specialization/11-clustering-spacing/src/main/java/playground/stanford/clustering/MaximizeKClusteringSpacing.java}} | |||
=Correctness Proof= | |||
{{External|https://www.coursera.org/learn/algorithms-greedy/lecture/7lWTf/correctness-of-clustering-algorithm}} |
Latest revision as of 04:30, 24 October 2021
External
Internal
Overview
This is a greedy algorithm that aims to maximize spacing between any two clusters. By spacing between two clusters we mean the minimum distance between any two separated points, which belong to two distinct clusters:
minseparated p, qd(p,q)
"Good" clustering means that all of the separated points should be as far apart as possible.
Algorithm
The conceptual pseudocode to achieve k-clustering with maximum possible spacing is:
initially each point is a separate cluster repeat until only k clusters: let p,q = closest pair of separated points # determines the current spacing merge the clusters containing p and q into a single cluster
This algorithm is a single-link clustering algorithm, in that it refines the clustering by fusing clusters one point at a time. It is very similar in structure to the Kruskal's algorithm. A practical implementation that sorts the distances in a pre-processing steps, in a similar manner to how the Kruskal's algorithm does it, and then manages clusters with an union-find is:
initialize union-find U # each point belongs initially to its own separate one-point cluster sort distances # distances between point are sorted from smallest to largest for d=(p,q) in sorted_distances: # process distances from smallest to largest if U.find(p) != U.find(q): # points belong to different clusters if U.clusterCount() == k: # we reached the target cluster count, return the spacing between clusters return d else: U.union(cluster(p), cluster(q)) # merge the clusters