Statistical Clustering Analysis: An Introduction - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

analysis. For example, in image segmentation, clusters are regions in the image,

each of which is considered to ”be homogeneous with respect to some image

property of interests such as intensity, color or texture” [16]. In variable clustering,

usually a cluster is a group of sequences that are associated with each other. For

instance, clustering the spike train data, sequences of the spiking time stamps, of

multiple brain neurons identifies the associations among brain neurons. In some

other applications, a cluster may be considered as a sample from an underlying

probabilistic distribution. In the example of fish data, objects in a cluster can be

considered as a random sample from a multivariate distribution.

The goal of clustering analysis also varies with applications. In image pro-

cessing, the purposes of clustering analysis mostly include detecting edges of ob-

jects [28], and image segmentation. Image segmentation is a common problem

in image processing. It involves taking an image and identifying particular fea-

tures, such as the figure of human beings or a vehicle, for further purpose such

as movement tracking. If properly implemented, clustering analysis can automat-

ically divide an image into similar regions. In some other applications, clustering

analysis may be to refer the underlying distributions generating the clusters, such

as the number of underlying distributions and the parameters of each distribution.

The readers should be noted about the difference between clustering and clas-

sification. Classification is also called supervised learning. Given a collection of

labeled objects, we derive the discrimination model which is later used to label

a new object without a class label. Clustering, also called unsupervised learning,

is to group a collection of unlabeled objects into meaningful clusters. After clus-

tering, objects in the same cluster are given the same labels. Objects in different

clusters are labeled differently.

In this chapter, we introduce clustering analysis mostly from the perspective of

multivariate statistics. For the convenience of the readers, we also introduce some

heuristic methods in case the readers may need them in some applications where

it is not proper to assume the multivariate probability distribution. We focus on

two basic aspects of clustering analysis: clustering and determining the number

of clusters.

As in the definition of clustering analysis, measure of similarity (or dissimi-

larity) plays an important role. Before we go into those two topics, we describe

the measures of similarity (or dissimilarity) between two objects.

Before moving forward, we first give the notations which will be used in the

remainder of this chapter.

We denote the dataset to be clustered as X ,which

is an N

P matrix where P is the number of features (variables), and N is the

number of observations. Here, X stands for the transpose of X . In observation

clustering, observation i is characterized by the i th row of X , denoted as x i ,

×

Search WWH ::

Custom Search

Home