Clustering Concepts: Difference between revisions
No edit summary |
|||
(25 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
The distance function is symmetric: d(p,q) = d(q,p). | The distance function is symmetric: d(p,q) = d(q,p). | ||
Examples of distance functions: euclidian distance between two points in space | Examples of distance functions: | ||
* euclidian distance between two points in space. | |||
* [[Hamming Distance#Overview|Hamming distance]]. | |||
* the penalty of the best alignment between two genome fragments. | |||
=Coherent Groups= | =Coherent Groups= | ||
A coherent group in the universe of points are those points that have small [[#Distance|distances]] from each other, so they are "similar". They are considered to be "nearby" to each other, and thus form clusters. | A coherent group in the universe of points are those points that have small [[#Distance|distances]] from each other, so they are "similar". They are considered to be "nearby" to each other, and thus form clusters. | ||
=Objective Function= | =Objective Function= | ||
A way to evaluate how good the clustering is is to define an objective function, and seek out the clustering that optimizes the objective function. This is called optimization. There is more than one objective functions: | A way to evaluate how good the clustering is is to define an objective function, and seek out the clustering that optimizes the objective function. This approach is called '''optimization'''. There is more than one objective functions, and among known objective functions there are the [[#Spacing|spacing]] objective function, [[#K-Means|K-Means]] objective function, etc.: | ||
==K-Means | ==<span id='Spacing'></span>Spacing of a K-Clustering== | ||
The objective function in this case is called the '''spacing''' of a k-clustering and is the minimum distance between two '''separated''' points, which belong to two distinct clusters. | |||
<font size=-1> | |||
min<sub>separated p, q</sub>d(p,q) | |||
</font> | |||
"Good" clustering means that all of the separated points should be as far apart as possible, the spacing to be big, bigger the better. This is an example of a greedy algorithm that identifies k cluster with the maximum spacing possible: | |||
{{Internal|Maximize K-Clustering Spacing#Overview|Maximize K-Clustering Spacing}} | |||
==K-Means== | |||
{{External|https://www.saedsayad.com/clustering_kmeans.htm}} | {{External|https://www.saedsayad.com/clustering_kmeans.htm}} | ||
K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error objective function J: | K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error objective function J: | ||
<font size=-1> | <font size=-1> | ||
k n | k n | ||
J = ∑ ∑ | J = ∑ ∑ ‖x<sub>i</sub><sup>(j)</sup>-c<sub>j</sub>‖<sup>2</sup> | ||
j=1 i=1 | j=1 i=1 | ||
</font> | </font> | ||
Line 35: | Line 46: | ||
=The Clustering Problem= | =The Clustering Problem= | ||
= | The clustering problem consists in grouping a set of n objects in k clusters in such a way to optimize an [[#Objective_Function|objective function]], given a specific [[#Distance_Function|distance function]] between any two pairs of objects. | ||
=Single-Link Clustering= | |||
Single-link clustering is a class of algorithms that identifies clusters by fusing components one at a time. [[Maximize_K-Clustering_Spacing#Overview|Maximize K-Clustering Spacing]] is an example of single-link clustering. | |||
=Clustering Algorithms= | |||
There are many clustering algorithms, some of them are [[Algorithms#Greedy_Algorithms|greedy]]. Example of clustering algorithms: | |||
* [[Maximize K-Clustering Spacing]] |
Latest revision as of 20:43, 24 October 2021
External
Internal
Overview
We talk about "clustering" when we have a set of n "points", which we may think about as points in space in geometrical sense. It is actually quite rare that the underlying problem we care about is intrinsically geometric. Usually we are representing something else we care about (web pages, genome sequence fragments, etc.) and we want to cluster them in coherent groups. In machine learning, the same problem is referred to as "unsupervised learning", meaning that the data is unlabeled and we are looking for patterns in data, when data is not annotated.
Distance Function
Clustering problems require a similarity measure, which is a function that for any two object returns a numerical result that expresses how similar (or dissimilar) those objects are. We refer to this function as the distance function. The distance function d(p,q) returns the distance between each point pair.
The distance function is symmetric: d(p,q) = d(q,p).
Examples of distance functions:
- euclidian distance between two points in space.
- Hamming distance.
- the penalty of the best alignment between two genome fragments.
Coherent Groups
A coherent group in the universe of points are those points that have small distances from each other, so they are "similar". They are considered to be "nearby" to each other, and thus form clusters.
Objective Function
A way to evaluate how good the clustering is is to define an objective function, and seek out the clustering that optimizes the objective function. This approach is called optimization. There is more than one objective functions, and among known objective functions there are the spacing objective function, K-Means objective function, etc.:
Spacing of a K-Clustering
The objective function in this case is called the spacing of a k-clustering and is the minimum distance between two separated points, which belong to two distinct clusters.
minseparated p, qd(p,q)
"Good" clustering means that all of the separated points should be as far apart as possible, the spacing to be big, bigger the better. This is an example of a greedy algorithm that identifies k cluster with the maximum spacing possible:
K-Means
K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error objective function J:
k n J = ∑ ∑ ‖xi(j)-cj‖2 j=1 i=1
where:
- k is the number of clusters
- n is the number of cases
- xi(j) is the case i
- cj is the centroid for cluster j
- ‖...‖ is the distance function
The Clustering Problem
The clustering problem consists in grouping a set of n objects in k clusters in such a way to optimize an objective function, given a specific distance function between any two pairs of objects.
Single-Link Clustering
Single-link clustering is a class of algorithms that identifies clusters by fusing components one at a time. Maximize K-Clustering Spacing is an example of single-link clustering.
Clustering Algorithms
There are many clustering algorithms, some of them are greedy. Example of clustering algorithms: