Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Chapter 7

Constrained Partitional Clustering

of Text Data: An Overview

Sugato Basu and Ian Davidson

7.1

Introduction

.............................................................

155

7.2

Uses of Constraints

......................................................

157

7.3

Text Clustering

..........................................................

159

7.4

Partitional Clustering with Constraints

.................................

163

7.5

Learning Distance Function with Constraints

...........................

168

7.6

Satisfying Constraints and Learning Distance Functions

...............

170

7.7

Experiments

.............................................................

174

7.8

Conclusions

..............................................................

180

7.1 Introduction

Clustering is ubiquitously used in data mining as a method of discovering

novel and actionable subsets within a set of data. Given a set of data X ,

the typical aim of partitional clustering is to form a k -block set partition Π k

of the data. The process of clustering is important since, being completely

unsupervised, it allows the addition of structure to previously unstructured

items such as free-form text documents. For example, Cohn et al. (12) discuss

a problem faced by Yahoo!, namely that one is given very large corpora of text

documents/papers/articles and asked to create a useful taxonomy so that sim-

ilar documents are closer in the taxonomy. Once the taxonomy is formed, the

documents can be eciently browsed and accessed. Unconstrained clustering

is ideal for this initial situation, since in this case little domain expertise ex-

ists to begin with. However, as data mining progresses into more demanding

areas, the chance of finding actionable patterns consistent with background

knowledge and expectation is limited.

Clustering with constraints or semi-supervised clustering is an emerging

area of great importance to data mining that allows the incorporation of

background domain expertise. Work so far has incorporated this knowledge

into clustering in the form of instance level constraints. The two types of

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home