Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

micro-array experiment and the literature informa-

tion outperforms conventional clustering.

Most algorithms for semi-supervised clustering

model potentially incomplete side-information

by a set of constraints. Must-link constraints are

established between pairs of objects which should

be clustered together, for example two genes

which are known to be functionally related from

literature, and cannot-link-constraints are estab-

lished between objects which should be assigned

to different clusters. This set of constraints is then

incorporated into a modified clustering algorithm.

Following the classification of clustering algo-

rithms introduced in the background section, we

can assign most algorithms for semi-supervised

clustering into the classes of iterative partitioning

and hierarchical density-based algorithms.

The first iterative partitioning algorithm for

semi-supervised clustering is COP-K-means in-

troduced in (Wagstaff et al 2001). COP-K-means

extends K-means with must-link and cannot-link

constraints formulated as described above. The

cluster assignment step of the basic K-means

algorithm is modified as follows: Objects with

cannot-link constraints must be assigned to differ-

ent clusters. If this is not possible, i.e. if no cluster

exists to host an object, the algorithm aborts. Note

that supervision is only used for the cluster assign-

ment of the objects for which side information is

available, also often called the labeled objects,

and not for the other objects, called the un-labeled

objects. Shental et al. (2003) propose a constrained

version of the EM algorithm. Must-link constraints

can be respected by a modification of the E- step:

For two objects with a must-link constraint,

only the proportion of probability respecting the

constraint is considered. The incorporation of

cannot-link-constraints is more complex since

they are not transitive. To incorporate both types

of constraints into EM, a Markov network is ap-

plied. Because of the probabilistic nature of the

EM algorithm, supervision not only affects the

clustering of the labeled objects, but implicitly

also the clustering of the un-labeled objects.

The MPCK-means (Metric Pairwise Constraint

K-means) algorithm proposed in (Bilenko et al.

2004) integrates constraints and metric learning

into K-means and thereby explicitly extends the

influence of supervision to the un-labeled objects.

For each cluster, Euclidean distance is parameter-

ized with a weight matrix which is updated in each

iteration of the algorithm. The update rule for the

weight matrix considers the constraint violations

inside a cluster proportionally to their severity. If

for example the cluster contains two cannot-link

objects which are very close together, the metric

needs to be altered more drastically as if these

points are already far away from each other. In

the second case it is likely that the objects are

assigned to different clusters in the next iteration

even without any metric change. Although each

cluster has its own associated weight matrix, in the

update step global metric learning is performed

by the linear transformation best representing

the metric changes in all clusters. In (Basu et al.

2004) this idea is theoretically liked to Hidden

Markov Random Fields (HMRF) and extended

to the Bregman divergences, a wide range of

distance functions including cosine similarity for

text data. The foundation on HMRF allows defin-

ing a kernel for semi-supervised graph clustering

(Kulis et al. 2005).

Due to the different cluster notion, hierarchical

and density-based algorithms to semi-supervised

clustering integrate supervision information in

different ways. The first algorithm in this cat-

egory is the CCL (Constrained Complete Link)

algorithm proposed by (Klein et al. 2002) which

integrates constraints into Complete Link cluster-

ing. Before clustering, the constraints are used to

modify the distance matrix between objects: the

distance between must-link objects is set to zero.

To preserve the metric property of the data space,

distances between other objects are adjusted by

a shortest-path algorithm. The distance between

cannot-link objects is set to a value larger than

the maximal distance occurring in the data set.

Cannot-link constraints are implicitly propagated

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home