Biology Reference
In-Depth Information
5.3. Clustering Algorithm
A big variant of clustering algorithms has been devised to cover different appli-
cations. Currently there is no one clustering algorithm that performs well on all
dataset. Which clustering algorithm should be chosen depends on the definition
of clusters and the size of the dataset in a specific application. For instance, in
image segmentation, density-based methods are popularly used such as DBSCAN
and GDBSCAN algorithms [6] [7]. Some other clustering algorithms are devised
to cluster large datasets, such as CLARA and CLARANS. Wei et al. [31] gives
a good review of these large dataset clustering algorithms. For other clustering
algorithms, Han and Chamber [13] give good general references.
In this section, for the sake of space, we only introduce four basic but most
popularly used clustering algorithms: K -means, E-M algorithm, hierarchical clus-
tering algorithm and self-organization maps (SOM) algorithm. K -means algo-
rithm is one of the most popularly used clustering algorithms. E-M algorithm is
a method popularly used in multivariate statistics for missing value estimation. If
we take the label of each object as the missing value in clustering analysis, E-M
algorithm applies. We introduce E-M algorithm here because it is analogous to
K -means algorithm and it provides statistical background for K -means algorithm.
We introduce hierarchical clustering algorithm as one example of heuristic algo-
rithms. K -means, E-M, and hierarchical clustering algorithms, in most common
case, load all the data into the computer memory at the same time for clustering
analysis. Contrarily in SOM algorithm, the data enters the computer memory for
clustering one by one. It has advantage in handling large dataset or data streams,
where either the dataset is too large to be all loaded to the computer memory at
one time, or the data itself emerges sequentially.
5.3.1. K-Means Algorithm
The K -means algorithm assumes the number of clusters K is known. It works
iteratively as follows, where we use t to denote the iteration number:
(1) Randomly select K objects as the initial centers of these K clusters, denoted
as x t 1 , x t 2 ,..., x t K ; t =0;
(2) For object i , calculate s ( x i , x j ) or d ( x i , x j ), j =1, 2,..., K . Assign cluster label
to object i by
CL i =argmax j [ s ( x i , x j ) , j =1 , 2 ,..., K ]
or
CL i =argmin j [ d ( x i , x j ) , j =1 , 2 ,..., K ] , i =1 , 2 ,..., N
(5.13)
where CL i stands for cluster label of object i at iteration t ;
Search WWH ::




Custom Search