Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining - page 408

Database Reference

In-Depth Information

y dimension

0:10

2:4

1:13

1

x dimension

0:2

1:5

2:4

3:2

1:4

1:2

2:3

3:5

2

Fig. 16.8 SCY-tree index for depth-first mining of subspace clusters. Nodes contain cell counts as

in frequent itemset mining. Levels correspond to different dimensions, and additional marker nodes

indicate that a border is non-empty and that cells need to be merged during mining. For example,

the gray shaded node labeled '2' at the bottom corresponds to the non-empty border of cell '2' in

dimension x in Fig. 16.7

algorithm yet independent of the cluster model. In the first scenario, we can regard

subspace search as a global identification of 'interesting' subspaces—subspaces in

which we expect clusters to exist—and hence as a restriction of the search space. In

the second scenario, we observe a local identification of 'interesting' subspaces. A

typical use case of these 'locally interesting' subspaces is to adapt distance measures

locally, that is, for different clusters, different measures of similarity are applied.

Global Subspace Search ENCLUS [ 21 ] is based on an assessment of the subspace

as a whole, i.e., a subspace search step proceeds the actual subspace clustering. In

order to determine interesting subspaces, Shannon Entropy [ 73 ] is used. Entropy

measures the uncertainty in a random variable, where a high value means a high

level of uncertainty. A uniform distribution implies greatest uncertainty, so a low

entropy value (below some threshold) is used as an indication of subspace clusters.

Similar to CLIQUE, the data are discretized into equal-width cells before entropy

assessment. Monotonicity is based on the fact that an additional attribute can only

increase the uncertainty and thereby the Shannon Entropy:

T

has low entropy

⇒∀ S ⊆ T

:

S

has low entropy.

(16.5)

Besides this Apriori bottom-up part of the algorithm, an additional mutual infor-

mation criterion is used for top-down pruning. Interesting subspaces in this sense are

those with an entropy that is lower (by some threshold) than the sum of the entropy

of each of its one-dimensional subspaces. Using both criteria, the most interesting

subspaces for subspace clustering according to ENCLUS are located neither at the

top nor at the bottom of the subspace search space, but at some medium dimension-

ality. This resembles the concept of borders in frequent itemset mining (Sect. 2.3 ).

While there borders are used to derive a condensed representation of the result set,

here, the result set is restricted to reduce the redundancy of too many clusters.

For RIS (Ranking Interesting Subspaces) [ 47 ], subspaces are 'interesting' if they

have a large number of points in the neighborhoods of core points (i.e., points with a

high local point density according to some thresholds), normalized by the expected

number of points assuming uniform distribution. While this criterion adopts a density-

based notion [ 51 ] of 'interesting', it is not tied to a specific clustering algorithm.

Next Page

Frequent Pattern Mining

Search WWH ::

Custom Search

Home