Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

3.5.2 Other Applications

In this chapter, we focused on the application of Opossum and Clusion to

market basket data. The framework is more general and can be applied to a

variety of other domains, such as text document or Web session clustering.

For each domain, the similarity metric s and the object weights have to

be adopted. For text clustering, similarity of two pages can be defined as

the cosine of their term frequency vectors. Page weight can be chosen to be

proportional to the document length. In Web session clustering, the similarity

can be based on the cosine of the visited pages vector and the weight can be

based on the time spent on a particular Web page. For details and results,

see [3.13] for document clustering and [3.22] for clustering Web sessions.

3.6 System Issues

3.6.1 Synergy Between OPOSSUMandCLUSION

The visualization and clustering techniques presented in this work need to

be considered together, not in isolation. This is because

Clusion

is particu-

larly suited to viewing the output of

. First, the similarity matrix

is already computed during the clustering step, so no extra computation is

needed, except for permuting this matrix, which can be done in O(n) time

because the size and seriation order of each partition are known. Second,

because Metis involves boundary Kernighan-Lin refinement, clusters that are

similar appear closer in the seriation order. Thus it is no coincidence that

clusters

Opossum

C 15 appear contiguous in Fig. 3.4(e). Finally, one can exper-

iment with different similarity measures for

C 14 and

Opossum

and quickly get visual

feedback regarding their effectiveness using

Clusion

.

3.6.2 Scalability

The computational bottleneck in the overall process lies in calculating the

similarity matrix, which involves O(n 2 d) operations, because similarity needs

to be computed between each pair of data points and involves all the dimen-

sions. 7 However, once this matrix is computed, any subsequent clustering

routine does not depend on d at all! Metis is very fast, almost linear in the

number of vertices for reasonably sparse graphs, as has been shown over nu-

merous experiments [3.11]. Finally, the reordering of the similarity matrix for

visualization is O(n). Thus the overall method is linear in d.

The quadratic complexity with respect to the number of objects, n,is

problematic for large data sets. Note that any clustering algorithm that com-

pares an object with all others (e.g., agglomerative, all relationship-based

7 By exploiting sparsity, computation of a single similarity value can be reduced

from O(d)toO(numberofnonzeros in d).

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home