Advanced Cluster Analysis - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

us to obtain a succinct representation of nonredundant

-pCluster is

maximal if no more rows or columns can be added into the cluster while maintaining the

-pClusters. A

-pCluster property. To avoid redundancy, instead of finding all

-pClusters, we only

need to compute all maximal

-pClusters.

MaPle is an algorithm that enumerates all maximal

-pClusters. It systematically

enumerates every combination of conditions using a set enumeration tree and a depth-

first search. This enumeration framework is the same as the pattern-growth methods

for frequent pattern mining (Chapter 6). Consider gene expression data. For each con-

dition combination, J , MaPle finds the maximal subsets of genes, I , such that I J is

a

-pCluster. If I J is not a submatrix of another

-pCluster, then I J is a maximal

-pCluster.

There may be a huge number of condition combinations. MaPle prunes many

unfruitful combinations using the monotonicity of

-pClusters. For a condition com-

bination, J , if there does not exist a set of genes, I , such that I J is a

-pCluster, then

we do not need to consider any superset of J . Moreover, we should consider I J as a

candidate of a

-subset J 0 of J , I J 0 is a

-pCluster.

MaPle also employs several pruning techniques to speed up the search while retaining

the completeness of returning all maximal

-pCluster only if for every

.j J j1

/

-pClusters. For example, when examining a

current

-pCluster, I J , MaPle collects all the genes and conditions that may be added

to expand the cluster. If these candidate genes and conditions together with I and J form

a submatrix of a

-pCluster that has already been found, then the search of I J and any

superset of J can be pruned. Interested readers may refer to the bibliographic notes for

additional information on the MaPle algorithm (Section 11.7).

An interesting observation here is that the search for maximal

-pClusters in MaPle is

somewhat similar to mining frequent closed itemsets. Consequently, MaPle borrows the

depth-first search framework and ideas from the pruning techniques of pattern-growth

methods for frequent pattern mining. This is an example where frequent pattern mining

and cluster analysis may share similar techniques and ideas.

An advantage of MaPle and the other algorithms that enumerate all biclusters is that

they guarantee the completeness of the results and do not miss any overlapping biclus-

ters. However, a challenge for such enumeration algorithms is that they may become very

time consuming if a matrix becomes very large, such as a customer-purchase matrix of

hundreds of thousands of customers and millions of products.

11.2.4 DimensionalityReductionMethodsandSpectral

Clustering

Subspace clustering methods try to find clusters in subspaces of the original data

space. In some situations, it is more effective to construct a new space instead of using

subspaces of the original data. This is the motivation behind dimensionality reduction

methods for clustering high-dimensional data.

Example11.14 Clustering in a derived space. Consider the three clusters of points in Figure 11.9. It is

not possible to cluster these points in any subspace of the original space, X Y , because

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home