Database Reference
In-Depth Information
micro-array experiment and the literature informa-
tion outperforms conventional clustering.
Most algorithms for semi-supervised clustering
model potentially incomplete side-information
by a set of constraints. Must-link constraints are
established between pairs of objects which should
be clustered together, for example two genes
which are known to be functionally related from
literature, and cannot-link-constraints are estab-
lished between objects which should be assigned
to different clusters. This set of constraints is then
incorporated into a modified clustering algorithm.
Following the classification of clustering algo-
rithms introduced in the background section, we
can assign most algorithms for semi-supervised
clustering into the classes of iterative partitioning
and hierarchical density-based algorithms.
The first iterative partitioning algorithm for
semi-supervised clustering is COP-K-means in-
troduced in (Wagstaff et al 2001). COP-K-means
extends K-means with must-link and cannot-link
constraints formulated as described above. The
cluster assignment step of the basic K-means
algorithm is modified as follows: Objects with
cannot-link constraints must be assigned to differ-
ent clusters. If this is not possible, i.e. if no cluster
exists to host an object, the algorithm aborts. Note
that supervision is only used for the cluster assign-
ment of the objects for which side information is
available, also often called the labeled objects,
and not for the other objects, called the un-labeled
objects. Shental et al. (2003) propose a constrained
version of the EM algorithm. Must-link constraints
can be respected by a modification of the E- step:
For two objects with a must-link constraint,
only the proportion of probability respecting the
constraint is considered. The incorporation of
cannot-link-constraints is more complex since
they are not transitive. To incorporate both types
of constraints into EM, a Markov network is ap-
plied. Because of the probabilistic nature of the
EM algorithm, supervision not only affects the
clustering of the labeled objects, but implicitly
also the clustering of the un-labeled objects.
The MPCK-means (Metric Pairwise Constraint
K-means) algorithm proposed in (Bilenko et al.
2004) integrates constraints and metric learning
into K-means and thereby explicitly extends the
influence of supervision to the un-labeled objects.
For each cluster, Euclidean distance is parameter-
ized with a weight matrix which is updated in each
iteration of the algorithm. The update rule for the
weight matrix considers the constraint violations
inside a cluster proportionally to their severity. If
for example the cluster contains two cannot-link
objects which are very close together, the metric
needs to be altered more drastically as if these
points are already far away from each other. In
the second case it is likely that the objects are
assigned to different clusters in the next iteration
even without any metric change. Although each
cluster has its own associated weight matrix, in the
update step global metric learning is performed
by the linear transformation best representing
the metric changes in all clusters. In (Basu et al.
2004) this idea is theoretically liked to Hidden
Markov Random Fields (HMRF) and extended
to the Bregman divergences, a wide range of
distance functions including cosine similarity for
text data. The foundation on HMRF allows defin-
ing a kernel for semi-supervised graph clustering
(Kulis et al. 2005).
Due to the different cluster notion, hierarchical
and density-based algorithms to semi-supervised
clustering integrate supervision information in
different ways. The first algorithm in this cat-
egory is the CCL (Constrained Complete Link)
algorithm proposed by (Klein et al. 2002) which
integrates constraints into Complete Link cluster-
ing. Before clustering, the constraints are used to
modify the distance matrix between objects: the
distance between must-link objects is set to zero.
To preserve the metric property of the data space,
distances between other objects are adjusted by
a shortest-path algorithm. The distance between
cannot-link objects is set to a value larger than
the maximal distance occurring in the data set.
Cannot-link constraints are implicitly propagated
Search WWH ::




Custom Search