Database Reference
In-Depth Information
3.5.2 Other Applications
In this chapter, we focused on the application of Opossum and Clusion to
market basket data. The framework is more general and can be applied to a
variety of other domains, such as text document or Web session clustering.
For each domain, the similarity metric s and the object weights have to
be adopted. For text clustering, similarity of two pages can be defined as
the cosine of their term frequency vectors. Page weight can be chosen to be
proportional to the document length. In Web session clustering, the similarity
can be based on the cosine of the visited pages vector and the weight can be
based on the time spent on a particular Web page. For details and results,
see [3.13] for document clustering and [3.22] for clustering Web sessions.
3.6 System Issues
3.6.1 Synergy Between OPOSSUMandCLUSION
The visualization and clustering techniques presented in this work need to
be considered together, not in isolation. This is because
Clusion
is particu-
larly suited to viewing the output of
. First, the similarity matrix
is already computed during the clustering step, so no extra computation is
needed, except for permuting this matrix, which can be done in O(n) time
because the size and seriation order of each partition are known. Second,
because Metis involves boundary Kernighan-Lin refinement, clusters that are
similar appear closer in the seriation order. Thus it is no coincidence that
clusters
Opossum
C 15 appear contiguous in Fig. 3.4(e). Finally, one can exper-
iment with different similarity measures for
C 14 and
Opossum
and quickly get visual
feedback regarding their effectiveness using
Clusion
.
3.6.2 Scalability
The computational bottleneck in the overall process lies in calculating the
similarity matrix, which involves O(n 2 d) operations, because similarity needs
to be computed between each pair of data points and involves all the dimen-
sions. 7 However, once this matrix is computed, any subsequent clustering
routine does not depend on d at all! Metis is very fast, almost linear in the
number of vertices for reasonably sparse graphs, as has been shown over nu-
merous experiments [3.11]. Finally, the reordering of the similarity matrix for
visualization is O(n). Thus the overall method is linear in d.
The quadratic complexity with respect to the number of objects, n,is
problematic for large data sets. Note that any clustering algorithm that com-
pares an object with all others (e.g., agglomerative, all relationship-based
7 By exploiting sparsity, computation of a single similarity value can be reduced
from O(d)toO(numberofnonzeros in d).
Search WWH ::




Custom Search