Database Reference
In-Depth Information
constraints introduced by Wagstaff (46) are must-link denoted by c = ( x, y )
and cannot-link denoted by c = ( x, y ), meaning that two instance x and y must
be in the same cluster or cannot be in the same cluster respectively. Must-
link and cannot-link constraints, though apparently simple, share interesting
properties. Must-link constraints are an example of an equivalence relation
and hence are symmetrical, reflexive and transitive; this means that c = ( x, y )
and c = ( y, z )
c = ( x, z ) such that x, y, z form a connected component, i.e.,
each is connected to the other via an explicit or implied must-link constraint.
Similarly, multiple connected components of must-link constraints can give
rise to entailed cannot-link constraints, between pairs of instances in different
components.
Though apparently simple, must-link and cannot-link constraints are pow-
erful. In sucient numbers they can shatter the training set X and specify
any set partition of X . These constraints can be used to improve clustering
in different ways, which are outlined in the next section. Let us consider some
real-world examples where constraints are useful in text clustering.
Content Management : In content-management tasks (routinely per-
formed by companies like Google, Interwoven or Verity), the goal is to au-
tomatically categorize large amounts (often in the order of millions) of text
documents into groups or clusters. In this case, constraints can be obtained
from multiple auxiliary sources, e.g., the co-occurrence of two documents in a
directory can be used to infer a must-link constraint between the documents,
two documents in different categories of the Open Directory Project 1 hier-
archy can be considered as cannot-linked, etc. Using these constraints from
the auxiliary data sources, one can customize the clustering output for the
particular task, e.g., make a document hierarchy that is close to the input
directory structure in which the documents are placed.
Web mining : Constrained clustering is quite useful in post processing
search results, as performed by companies like Vivisimo. 2 Here, the goal is
to automatically cluster the results of ambiguous search-engine queries like
“jaguar” into clusters of URLs that refer to concepts like “Jaguar cars,”
“Jaguar animal” or “Jaguar Mac OS”. In this case, constraints can be mined
from query sessions in web logs - one can get valuable information regarding
which websites are visited together, by analyzing co-occurrence of url's within
the same user session. Clustering using this auxiliary data can help in biasing
the search result clustering towards the preferences of the user.
1 www.dmoz.org
2 www.vivisimo.com
Search WWH ::




Custom Search