Data clustering

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.

Besides the term data clustering, there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.

Data clustering algorithms can be hierarchical or partitional. With hierarchical algorithms, successive clusters are found using previously established clusters, whereas partitional algorithms determine all clusters in one go. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. The divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Contents

1 Hierarchical clustering

1.1 Introduction
1.2 Agglomerative hierarchical clustering

2 Partitional clustering

2.1 k-means and derivatives

2.1.1 k-means clustering
2.1.2 Fuzzy c-means clustering

3 Applications

4 External links

5 See also

Hierarchical clustering

Introduction

Hierarchical clustering builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.)

Missing image
Hierarchical_clustering_diagram.png
Traditional representation

Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters.

Agglomerative hierarchical clustering

This method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance <math>d(\mathrm{element}_1,\mathrm{element}_2)<math> between elements.

Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters <math>\mathcal{A}<math> and <math>\mathcal{B}<math> is one of the following:

the maximum distance between elements of each cluster (also called complete linkage clustering) <math> \max_{x \in \mathcal{A},\, y \in \mathcal{B}} d(x,y) <math>
the minimum distance between elements of each cluster (also called single linkage clustering) <math> \min_{x \in \mathcal{A},\, y \in \mathcal{B}} d(x,y) <math>
the mean distance between elements of each cluster (also called average linkage clustering) <math> {1 \over {\mathrm{card}(\mathcal{A})\mathrm{card}(\mathcal{B})}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y) <math>
the sum of all intra cluster variance
the increase in variance for the cluster being merged (Ward's criterion)

Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).

Partitional clustering

k-means and derivatives

k-means clustering

The k-means algorithm assigns each point to the cluster whose center (or centroid) is nearest. The centroid is the point generated by computing the arithmetic mean for each dimension separately for all the points in the cluster.

Example: The data set has three dimensions and the cluster has two points: X = (x₁, y₁, z₁) and Y = (x₂, y₂, z₂). Then the centroid Z becomes Z = (x₃, y₃, z₃), where x₃ = (x₁ + x₂)/2 and y₃ = (y₁ + y₂)/2 and z₃ = (z₁ + z₂)/2

This is the basic structure of the algorithm (J. MacQueen, 1967):

Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers
Assign each point to the nearest cluster center.
Recompute the new cluster centers.
Repeat until some convergence criterion is met (usually that the assignment hasn't changed).

The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

Fuzzy c-means clustering

One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster.

Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the kth cluster <math>u_k(x)<math>. Usually, the sum of those coefficients has to be one, so that <math>u_k(x)<math> denotes a probability of belonging to a certain cluster:

<math> \forall x \sum_{k=1}^{\mathrm{num.}\ \mathrm{clusters}} u_k(x) \ =1.<math>

With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is

<math>\mathrm{center}_k = {{\sum_x u_k(x) x} \over {\sum_x u_k(x)}}.<math>

The degree of being in a certain cluster is the inverse of the distance to the cluster

<math>u_k(x) = {1 \over d(\mathrm{center}_k,x)},<math>

then the coefficients are normalized and fuzzyfied with a real parameter <math>m>1<math> so that their sum is 1. So <math>u_k(x) = \frac{1}{\sum_j \left(\frac{d(\mathrm{center}_k,x)}{d(\mathrm{center}_j,x)}\right)^{\frac{1}{m-1}}}<math>

The fuzzy c-means algorithm is greatly similar to the k-means algorithm; when <math>m\leftarrow 1<math>, algorithms are equivalent:

Choose a number of clusters
Assign randomly to each point coefficients for being in the clusters
Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than <math>\epsilon<math>, the given sensitivity threshold) :
- Compute the centroid for each cluster, using the formula above
- For each point, compute its coefficients of being in the clusters, using the formula above

The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Applications

In biology has two main applications in the fields of computational biology and bioinformatics.

In transcriptomics, clustering is used to build groups of genes with related expression patterns. Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
In sequence analysis, clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.

External links

Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv, 1999. Available here (http://citeseer.ist.psu.edu/jain99data.html)
for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering (http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html). Also has an explanation on mixture of Gaussians.
http://members.tripod.com/asim_saeed/paper.htm

Data clustering

Hierarchical clustering

Introduction

Agglomerative hierarchical clustering

Partitional clustering

k-means and derivatives

k-means clustering

Fuzzy c-means clustering

Applications

External links

See also

Navigation

Search

Toolbox

Personal tools