Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining

Database Reference

In-Depth Information

Table 16.2 Translating

frequent pattern mining

concepts to subspace search

Frequent pattern mining

Subspace search

Item

Dimension (attribute)

Itemset

Subspace (set of attributes)

Frequent itemset

'Interesting' subspace

measure of 'interestingness' (see Table 16.2 for a summary). How to measure this

'interestingness' in a way to satisfy anti-monotonicity is the crucial question that

differs from approach to approach. Let us note that many methods follow the general

idea of candidate elimination in subspace search without adhering to a criterion of

strict anti-monotonicity, i.e., they rely on some observation that anti-monotonicity

of their criterion 'usually' holds.

2.2

Count Indexes

Generalized monotonicity is a very useful property towards pruning the search space

in both frequent itemset mining and subspace clustering. As part of the Apriori algo-

rithm, however, candidate itemsets or subspaces have to be generated. For large sets

of items and high-dimensional subspaces (i.e., subspaces with very many attributes),

this can be a performance bottleneck [ 37 ].

Taking a different approach, the so-called FP-Growth algorithm uses a special-

ized index structure to maintain frequency counts of itemsets, the FP-tree [ 37 ]. As

illustrated in Fig. 16.5 , a node in this count index corresponds to the frequency count

of a particular item, and following a path from an item to the root corresponds to

the frequency count of a particular combination of items into an itemset. The index

can be constructed in two data scans, where the first finds all frequent items, and the

second creates nodes and updates counts for each transaction.

The FP-Growth algorithm is a depth-first approach. Starting from the most

frequent item, the corresponding combinations with other items are 'grown' by re-

cursively extracting the corresponding paths, until the index has been reduced to one

path. The advantage of this method is that only frequent itemsets are generated, and

that only two scans over the data are necessary in order to do so.

As we will detail in the next section, this idea of compactly representing interesting

combinations in a count index and of proceeding in a depth-first traversal of the

search space has also been applied to subspace clustering. This application is not

straightforward due to the fact that both relevant subspace regions, as well as a

notion of similarity between adjacent regions has to be defined; concepts that do not

have one-to-one counterparts in frequent pattern mining.

2.3

Pattern Explosion and Redundancy

The downside to the frequency criterion and its monotonicity in frequent itemset

mining is that with a threshold low enough to avoid exclusion of all but the most

Frequent Pattern Mining

Search WWH ::

Custom Search

Home