Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

constraints introduced by Wagstaff (46) are must-link denoted by c = ( x, y )

and cannot-link denoted by c = ( x, y ), meaning that two instance x and y must

be in the same cluster or cannot be in the same cluster respectively. Must-

link and cannot-link constraints, though apparently simple, share interesting

properties. Must-link constraints are an example of an equivalence relation

and hence are symmetrical, reflexive and transitive; this means that c = ( x, y )

and c = ( y, z )

c = ( x, z ) such that x, y, z form a connected component, i.e.,

each is connected to the other via an explicit or implied must-link constraint.

Similarly, multiple connected components of must-link constraints can give

rise to entailed cannot-link constraints, between pairs of instances in different

components.

Though apparently simple, must-link and cannot-link constraints are pow-

erful. In sucient numbers they can shatter the training set X and specify

any set partition of X . These constraints can be used to improve clustering

in different ways, which are outlined in the next section. Let us consider some

real-world examples where constraints are useful in text clustering.

Content Management : In content-management tasks (routinely per-

formed by companies like Google, Interwoven or Verity), the goal is to au-

tomatically categorize large amounts (often in the order of millions) of text

documents into groups or clusters. In this case, constraints can be obtained

from multiple auxiliary sources, e.g., the co-occurrence of two documents in a

directory can be used to infer a must-link constraint between the documents,

two documents in different categories of the Open Directory Project 1 hier-

archy can be considered as cannot-linked, etc. Using these constraints from

the auxiliary data sources, one can customize the clustering output for the

particular task, e.g., make a document hierarchy that is close to the input

directory structure in which the documents are placed.

Web mining : Constrained clustering is quite useful in post processing

search results, as performed by companies like Vivisimo. 2 Here, the goal is

to automatically cluster the results of ambiguous search-engine queries like

“jaguar” into clusters of URLs that refer to concepts like “Jaguar cars,”

“Jaguar animal” or “Jaguar Mac OS”. In this case, constraints can be mined

from query sessions in web logs - one can get valuable information regarding

which websites are visited together, by analyzing co-occurrence of url's within

the same user session. Clustering using this auxiliary data can help in biasing

the search result clustering towards the preferences of the user.

⇒

1 www.dmoz.org

2 www.vivisimo.com

Search WWH ::

Custom Search

Home