Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

The permutation matrix then turns out to be

⎛

⎝

⎞

⎠

0010

0100

0001

1000

P =

.

(3.9)

The permuted similarity matrix S and the corresponding label vector λ and

data matrix X are:

S = PSP † ,λ = P λ, X = PX .

(3.10)

So in our example

⎛

⎝

⎞

⎠

1

2

3

λ =

is now serialized (through row swapping) and the corresponding similarity

matrix S contains the entries shown in

Clusion

(through row and column

swapping).

For a “good” clustering algorithm and k

n, this is related to sparse

matrix reordering, for this results in the generation of a “banded matrix”

where high entries should all fall near the diagonal line from the upper left

to the lower right of the matrix. Because Eq. 3.10 is essentially a partial

ordering operation we also refer to it as coarse seriation, a phrase used in

disciplines such as anthropology and archaeology to describe the reordering

of the primary data matrix so that similar structures (e.g., genetic sequences)

are brought closer [3.20], [3.21].

→

3.4.2 Visualization

The seriation of the similarity matrix, S , is very useful for visualization. Be-

cause the similarity matrix is two dimensional, it can be readily visualized

as a gray-level image where a white (black) pixel corresponds to minimal

(maximal) similarity of 0 (1). The darkness (gray-level value) of the pixel

at row a and column b increases with the similarity between the samples

x a and x b . When looking at the image it is useful to consider the similar-

ity s as a random variable taking values from 0 to 1. The similarity within

cluster is thus represented by the average intensity within a square region

with side length n around the main diagonal of the matrix. The off-diagonal

rectangular areas visualize the relationships between clusters. The brightness

distribution in the rectangular areas yields insight toward the quality of the

clustering and possible improvements. To make these regions apparent, thin

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home