Database Reference
In-Depth Information
Table 16.2 Translating
frequent pattern mining
concepts to subspace search
Frequent pattern mining
Subspace search
Item
Dimension (attribute)
Itemset
Subspace (set of attributes)
Frequent itemset
'Interesting' subspace
measure of 'interestingness' (see Table 16.2 for a summary). How to measure this
'interestingness' in a way to satisfy anti-monotonicity is the crucial question that
differs from approach to approach. Let us note that many methods follow the general
idea of candidate elimination in subspace search without adhering to a criterion of
strict anti-monotonicity, i.e., they rely on some observation that anti-monotonicity
of their criterion 'usually' holds.
2.2
Count Indexes
Generalized monotonicity is a very useful property towards pruning the search space
in both frequent itemset mining and subspace clustering. As part of the Apriori algo-
rithm, however, candidate itemsets or subspaces have to be generated. For large sets
of items and high-dimensional subspaces (i.e., subspaces with very many attributes),
this can be a performance bottleneck [ 37 ].
Taking a different approach, the so-called FP-Growth algorithm uses a special-
ized index structure to maintain frequency counts of itemsets, the FP-tree [ 37 ]. As
illustrated in Fig. 16.5 , a node in this count index corresponds to the frequency count
of a particular item, and following a path from an item to the root corresponds to
the frequency count of a particular combination of items into an itemset. The index
can be constructed in two data scans, where the first finds all frequent items, and the
second creates nodes and updates counts for each transaction.
The FP-Growth algorithm is a depth-first approach. Starting from the most
frequent item, the corresponding combinations with other items are 'grown' by re-
cursively extracting the corresponding paths, until the index has been reduced to one
path. The advantage of this method is that only frequent itemsets are generated, and
that only two scans over the data are necessary in order to do so.
As we will detail in the next section, this idea of compactly representing interesting
combinations in a count index and of proceeding in a depth-first traversal of the
search space has also been applied to subspace clustering. This application is not
straightforward due to the fact that both relevant subspace regions, as well as a
notion of similarity between adjacent regions has to be defined; concepts that do not
have one-to-one counterparts in frequent pattern mining.
2.3
Pattern Explosion and Redundancy
The downside to the frequency criterion and its monotonicity in frequent itemset
mining is that with a threshold low enough to avoid exclusion of all but the most
 
Search WWH ::




Custom Search