Biology Reference
In-Depth Information
9.1. Introduction
Clustering is described as finding groups of data points so that the points in each
group are similar to each other according to their attributes (dimensions). Clus-
tering has been discussed thoroughly by both statistics and database communi-
ties due to its numerous applications in problems such as classification, machine
learning, data mining, indexing, trend analysis, customer segmentation, and pat-
tern recognition. Most known clustering algorithms proposed in the last few years,
such as CLARANS [12], DBSCAN [7], and CURE [9] cluster data points based
on the distance function that uses all dimensions of the data. They are proved
to work well when datasets are low dimensional. However, when dimensional
space grows higher, the above algorithms lose their efficiency and accuracy be-
cause of the inherent sparsity of data [3]. It is shown in [5] that the distance of
a point to the nearest neighbor approaches the distance to its farthest neighbor as
dimensionality increases, which means computing the distance based on full di-
mensions is not meaningful in high dimensional space. Actually, natural clusters
might exist in subspaces. In other words, data points may be close to each other
in a few dimensions, but not all dimensions. Furthermore, data points in different
clusters may be correlated with respect to different subsets of dimensions. Feature
selection [11] is one technique that has been proposed to deal with data in high
dimensions. It reduces the dimensionality of the space before clustering. On the
other hand, it may lead to a loss of information by choosing a subset of dimen-
sions ahead of time and it cannot select different subsets of dimensions for dif-
ferent clusters. Another solution is called dimension reduction. One widely used
approach is principal component analysis [6] which is concerned with projecting
all data points on the same subspace to minimize the loss of information. One
drawback is that the reduced new dimensions might be difficult to interpret, while
most applications require simple descriptions of clusters. Moreover, it doesn't
handle well the case where clusters may exist in different subspaces of full di-
mensions. In Fig. 9.1, we illustrate two projections on different dimensions for a
set of points in 3-dimensional space. It is obvious that there is no cluster in Fig.
9.1(a). However, when the data points are projected in x
z plane, cluster P and
Q in Fig. 9.1(b) are discovered and cluster R and S in Fig. 9.1(c) are found when
they are projected in y
z plane.
Projected clustering has been proposed recently to effectively deal with high
dimensionalities. The projected clustering problem is known as finding clusters
and their relevant attributes from a dataset. Instead of projecting entire dataset
on the same subspace in dimension reduction techniques, projected clustering fo-
cuses on finding specific projection for each cluster such that the similarity is
Search WWH ::




Custom Search