Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

7.2 Uses of Constraints

The typical supervised learning situations involves having a label associ-

ated with each instance. The semi-supervised learning situation is when only

a small subset of instances have labels. If the available labeled data repre-

sent all the relevant categories, then semi-supervised classification algorithms

can be readily used for data categorization. For details see the various algo-

rithms in the surveys (42; 49). However in many domains, knowledge of the

relevant categories is incomplete. Moreover, pairwise constraints are often a

more naturally available form of supervision than labels in certain cluster-

ing tasks. Moreover, in an interactive learning setting, a user who is not a

domain expert can sometimes provide feedback in the form of must-link and

cannot-link constraints (12; 14) more easily than class labels, since providing

constraints does not require the user to have significant prior knowledge about

the categories in the dataset.

Constraints have typically been used in clustering algorithms in two ways.

Constraints can be used to modify the cluster assignment stage of the cluster

algorithm, to enforce satisfaction of the constraints as much as possible. Alter-

natively, the distance function of the clustering algorithm can also be trained

either before or after the clustering actually occurs using the constraints. In

all of these cases, constraints can also be used in the initialization phase,

where the initial clusters are formed such that must-linked instances are in

the same clusters and cannot-linked instances are in different clusters. Based

on this categorization, existing methods for constrained clustering can be put

into two general approaches that we call constraint-based and distance-based

methods.

7.2.1 Constraint-Based Methods

In constraint-based approaches, the clustering algorithm itself is modified

so that the available labels or constraints are used to bias the search for an

appropriate clustering of the data. The pairwise constraints specify whether

two instances should be in the same cluster (must-link) or in different clus-

ters (cannot-link). Constraint-based clustering has been done using several

techniques:

•

modifying the clustering objective function so that it includes a term

for satisfying specified constraints (17)

•

clustering using side-information from conditional distributions in an

auxiliary space (44)

•

enforcing constraints to be satisfied during the cluster assignment in the

clustering process (47)

Search WWH ::

Custom Search

Home