Advanced Cluster Analysis - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

maximal

-bicluster of the largest size is computationally costly. Therefore, we can use

a heuristic greedy search method to obtain a local optimal cluster. The algorithm works

in two phases.

In the deletion phase , we start from the whole matrix. While the mean-squared

residue of the matrix is over

, we iteratively remove rows and columns. At each

iteration, for each row i , we compute the mean-squared residue as

j 2 J .

j J j

2 .

e ij e iJ e Ij C e IJ /

(11.21)

Moreover, for each column j , we compute the mean-squared residue as

i 2 I .

j I j

2 .

e ij e iJ e Ij C e IJ /

(11.22)

We remove the row or column of the largest mean-squared residue. At the end of this

phase, we obtain a submatrix I J that is a

-bicluster. However, the submatrix may

not be maximal.

In the addition phase , we iteratively expand the

-bicluster I J obtained in the dele-

tion phase as long as the

-bicluster requirement is maintained. At each iteration, we

consider rows and columns that are not involved in the current bicluster I J by cal-

culating their mean-squared residues. A row or column of the smallest mean-squared

residue is added into the current

-bicluster.

-bicluster only. To find multiple biclusters that

do not have heavy overlaps, we can run the algorithm multiple times. After each execu-

tion where a

This greedy algorithm can find one

-bicluster is output, we can replace the elements in the output bicluster

by random numbers. Although the greedy algorithm may find neither the optimal

biclusters nor all biclusters, it is very fast even on large matrices.

EnumeratingAllBiclustersUsingMaPle

As mentioned, a submatrix I J is a bicluster with coherent values if and only if for any

i 1 , i 2 2 I and j 1 , j 2 2 J , e i 1 j 1 e i 2 j 1 D e i 1 j 2 e i 2 j 2 . For any 22 submatrix of I J , we can

define a p-score as

e i 1 j 1

e i 1 j 2

p -score

Dj.

e i 1 j 1 e i 2 j 1 /.

e i 1 j 2 e i 2 j 2 /j.

(11.23)

e i 2 j 1

e i 2 j 2

-pCluster (for p attern-based cluster ) if the p -score of every

22 submatrix of I J is at most

A submatrix I J is a

0 is a threshold specifying a user's

tolerance of noise against a perfect bicluster. Here, the p -score controls the noise on

every element in a bicluster, while the mean-squared residue captures the average noise.

An interesting property of

, where

-pCluster is that if I J is a

-pCluster, then every

x y

x , y 2

submatrix of I J is also a

-pCluster. This monotonicity enables

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home