A Projected Clustering Algorithm and Its Biomedical Application - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

9.1. Introduction

Clustering is described as finding groups of data points so that the points in each

group are similar to each other according to their attributes (dimensions). Clus-

tering has been discussed thoroughly by both statistics and database communi-

ties due to its numerous applications in problems such as classification, machine

learning, data mining, indexing, trend analysis, customer segmentation, and pat-

tern recognition. Most known clustering algorithms proposed in the last few years,

such as CLARANS [12], DBSCAN [7], and CURE [9] cluster data points based

on the distance function that uses all dimensions of the data. They are proved

to work well when datasets are low dimensional. However, when dimensional

space grows higher, the above algorithms lose their efficiency and accuracy be-

cause of the inherent sparsity of data [3]. It is shown in [5] that the distance of

a point to the nearest neighbor approaches the distance to its farthest neighbor as

dimensionality increases, which means computing the distance based on full di-

mensions is not meaningful in high dimensional space. Actually, natural clusters

might exist in subspaces. In other words, data points may be close to each other

in a few dimensions, but not all dimensions. Furthermore, data points in different

clusters may be correlated with respect to different subsets of dimensions. Feature

selection [11] is one technique that has been proposed to deal with data in high

dimensions. It reduces the dimensionality of the space before clustering. On the

other hand, it may lead to a loss of information by choosing a subset of dimen-

sions ahead of time and it cannot select different subsets of dimensions for dif-

ferent clusters. Another solution is called dimension reduction. One widely used

approach is principal component analysis [6] which is concerned with projecting

all data points on the same subspace to minimize the loss of information. One

drawback is that the reduced new dimensions might be difficult to interpret, while

most applications require simple descriptions of clusters. Moreover, it doesn't

handle well the case where clusters may exist in different subspaces of full di-

mensions. In Fig. 9.1, we illustrate two projections on different dimensions for a

set of points in 3-dimensional space. It is obvious that there is no cluster in Fig.

9.1(a). However, when the data points are projected in x

z plane, cluster P and

Q in Fig. 9.1(b) are discovered and cluster R and S in Fig. 9.1(c) are found when

they are projected in y

−

z plane.

Projected clustering has been proposed recently to effectively deal with high

dimensionalities. The projected clustering problem is known as finding clusters

and their relevant attributes from a dataset. Instead of projecting entire dataset

on the same subspace in dimension reduction techniques, projected clustering fo-

cuses on finding specific projection for each cluster such that the similarity is

−

Search WWH ::

Custom Search

Home