Statistical Clustering Analysis: An Introduction - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

5.3. Clustering Algorithm

A big variant of clustering algorithms has been devised to cover different appli-

cations. Currently there is no one clustering algorithm that performs well on all

dataset. Which clustering algorithm should be chosen depends on the definition

of clusters and the size of the dataset in a specific application. For instance, in

image segmentation, density-based methods are popularly used such as DBSCAN

and GDBSCAN algorithms [6] [7]. Some other clustering algorithms are devised

to cluster large datasets, such as CLARA and CLARANS. Wei et al. [31] gives

a good review of these large dataset clustering algorithms. For other clustering

algorithms, Han and Chamber [13] give good general references.

In this section, for the sake of space, we only introduce four basic but most

popularly used clustering algorithms: K -means, E-M algorithm, hierarchical clus-

tering algorithm and self-organization maps (SOM) algorithm. K -means algo-

rithm is one of the most popularly used clustering algorithms. E-M algorithm is

a method popularly used in multivariate statistics for missing value estimation. If

we take the label of each object as the missing value in clustering analysis, E-M

algorithm applies. We introduce E-M algorithm here because it is analogous to

K -means algorithm and it provides statistical background for K -means algorithm.

We introduce hierarchical clustering algorithm as one example of heuristic algo-

rithms. K -means, E-M, and hierarchical clustering algorithms, in most common

case, load all the data into the computer memory at the same time for clustering

analysis. Contrarily in SOM algorithm, the data enters the computer memory for

clustering one by one. It has advantage in handling large dataset or data streams,

where either the dataset is too large to be all loaded to the computer memory at

one time, or the data itself emerges sequentially.

5.3.1. K-Means Algorithm

The K -means algorithm assumes the number of clusters K is known. It works

iteratively as follows, where we use t to denote the iteration number:

(1) Randomly select K objects as the initial centers of these K clusters, denoted

as x t 1 , x t 2 ,..., x t K ; t =0;

(2) For object i , calculate s ∗ ( x i , x j ) or d ∗ ( x i , x j ), j =1, 2,..., K . Assign cluster label

to object i by

CL i =argmax j [ s ∗ ( x i , x j ) , j =1 , 2 ,..., K ]

or

CL i =argmin j [ d ∗ ( x i , x j ) , j =1 , 2 ,..., K ] , i =1 , 2 ,..., N

(5.13)

where CL i stands for cluster label of object i at iteration t ;

Search WWH ::

Custom Search

Home