Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining - page 404

Database Reference

In-Depth Information

Transactions

1

Frequencies

a c

a:4

c:1

a:4

b:1

c:4

d:3

2

a c d

3

a d

c:3

d:1

d:1

4

a b c

5

c d

d:1

b:1

Fig. 16.5 FP-tree example: the tree nodes store items and their counts, paths correspond to combi-

nations of itemsets and their respective counts. The index is built in just two scans over the data, and

the frequent itemset mining algorithm FP-Growth works exclusively on the index. Once individual

item frequencies are established, the second scan updates counts for each transaction or creates new

nodes where necessary

common (and therefore not really interesting) itemsets, the frequent itemsets will

usually be abundant and therefore, as a result of data exploration, not be useful either.

In frequent pattern mining, this phenomenon is known as the pattern explosion. By

the exponential size of possible subspaces, and type of interestingness measures,

subspace clustering inherited this problem with the transfer of the techniques from

frequent pattern mining. For non-trivial thresholds usually huge sets of subspace

clusters are discovered—which are typically quite redundant.

Different means have been studied to condense the result set of patterns or to

restrict the search space further in the first place.

One approach among others is mining or keeping only those itemsets that can-

not be extended further without dropping below the threshold, i.e., the maximal

frequent itemsets [ 16 ]. An alternative approach uses borders to represent a lossless

compression of the result set of frequent patterns, named closed frequent itemsets

[ 68 ]. Another branch of summarization is that of picking or creating a number of

representative results. Yan et al. [ 78 ] choose a subset of results such that the error

of predicting the frequencies in the complete result set is minimized. Mampaey et

al. [ 57 ] give an information theoretic approach to identifying that subset of results

by which the frequencies in either the complete result set, or the data in general, can

best be approximated. To this end, they define a maximum entropy model for data

objects, given knowledge about itemset frequencies. The resulting models capture

the general structure of the data very well, without redundancy.

Just as the basic techniques for frequent pattern mining, also these ideas for con-

densing the result, as well as restricting the search space, have found corresponding

solutions to the problem of redundant results in subspace clustering.

Next Page

Frequent Pattern Mining

Search WWH ::

Custom Search

Home