Database Reference
In-Depth Information
Transactions
1
Frequencies
a c
a:4
c:1
a:4
b:1
c:4
d:3
2
a c d
3
a d
c:3
d:1
d:1
4
a b c
5
c d
d:1
b:1
Fig. 16.5 FP-tree example: the tree nodes store items and their counts, paths correspond to combi-
nations of itemsets and their respective counts. The index is built in just two scans over the data, and
the frequent itemset mining algorithm FP-Growth works exclusively on the index. Once individual
item frequencies are established, the second scan updates counts for each transaction or creates new
nodes where necessary
common (and therefore not really interesting) itemsets, the frequent itemsets will
usually be abundant and therefore, as a result of data exploration, not be useful either.
In frequent pattern mining, this phenomenon is known as the pattern explosion. By
the exponential size of possible subspaces, and type of interestingness measures,
subspace clustering inherited this problem with the transfer of the techniques from
frequent pattern mining. For non-trivial thresholds usually huge sets of subspace
clusters are discovered—which are typically quite redundant.
Different means have been studied to condense the result set of patterns or to
restrict the search space further in the first place.
One approach among others is mining or keeping only those itemsets that can-
not be extended further without dropping below the threshold, i.e., the maximal
frequent itemsets [ 16 ]. An alternative approach uses borders to represent a lossless
compression of the result set of frequent patterns, named closed frequent itemsets
[ 68 ]. Another branch of summarization is that of picking or creating a number of
representative results. Yan et al. [ 78 ] choose a subset of results such that the error
of predicting the frequencies in the complete result set is minimized. Mampaey et
al. [ 57 ] give an information theoretic approach to identifying that subset of results
by which the frequencies in either the complete result set, or the data in general, can
best be approximated. To this end, they define a maximum entropy model for data
objects, given knowledge about itemset frequencies. The resulting models capture
the general structure of the data very well, without redundancy.
Just as the basic techniques for frequent pattern mining, also these ideas for con-
densing the result, as well as restricting the search space, have found corresponding
solutions to the problem of redundant results in subspace clustering.
 
Search WWH ::




Custom Search