Database Reference
In-Depth Information
Chapter 7
Constrained Partitional Clustering
of Text Data: An Overview
Sugato Basu and Ian Davidson
7.1
Introduction
.............................................................
155
7.2
Uses of Constraints
......................................................
157
7.3
Text Clustering
..........................................................
159
7.4
Partitional Clustering with Constraints
.................................
163
7.5
Learning Distance Function with Constraints
...........................
168
7.6
Satisfying Constraints and Learning Distance Functions
...............
170
7.7
Experiments
.............................................................
174
7.8
Conclusions
..............................................................
180
7.1 Introduction
Clustering is ubiquitously used in data mining as a method of discovering
novel and actionable subsets within a set of data. Given a set of data X ,
the typical aim of partitional clustering is to form a k -block set partition Π k
of the data. The process of clustering is important since, being completely
unsupervised, it allows the addition of structure to previously unstructured
items such as free-form text documents. For example, Cohn et al. (12) discuss
a problem faced by Yahoo!, namely that one is given very large corpora of text
documents/papers/articles and asked to create a useful taxonomy so that sim-
ilar documents are closer in the taxonomy. Once the taxonomy is formed, the
documents can be eciently browsed and accessed. Unconstrained clustering
is ideal for this initial situation, since in this case little domain expertise ex-
ists to begin with. However, as data mining progresses into more demanding
areas, the chance of finding actionable patterns consistent with background
knowledge and expectation is limited.
Clustering with constraints or semi-supervised clustering is an emerging
area of great importance to data mining that allows the incorporation of
background domain expertise. Work so far has incorporated this knowledge
into clustering in the form of instance level constraints. The two types of
 
 
 
 
 
 
 
 
 
Search WWH ::




Custom Search