Frequent Pattern Mining Algorithms: A Survey - Frequent Pattern Mining

Database Reference

In-Depth Information

be lower bounded as follows:

sup ( A ∪ B )

≥ sup ( A )

+ sup ( B )

− sup ( A ∩ B )

(2.1)

This condition follows directly from set-theoretic considerations. Thus, the support

of ( k

1)-candidates can be lower bounded in terms of the (already computed) support

values of itemsets of length k or less. If the computed value on the right-hand side

is greater than the required minimum support, then the counting of the candidate

does not need to be performed explicitly, and therefore considerable savings can be

achieved. An example of a method which uses this kind of pruning is the Apriori_LB

method [ 10 ].

Another interesting rule is that if the support of an itemset X is the same as that

of X

Y , then for any superset X ⊇

∪

X , it is the case that the support of the itemset

X is the same as that of X ∪

Y . This rule can be shown directly as a corollary of the

equation above. This is very useful in a variety of frequent pattern mining algorithms.

For example, once the support of X

has been shown to be the same as that of

X , then, for any superset X of X , it is no longer necessary to explicitly compute

the support of X ∪{

∪{

}

, after the support of X has already been computed. Such

optimizations have been shown to be quite effective in the context of many frequent

pattern mining algorithms [ 13 , 51 , 17 ]. As discussed later, this trick is not exclusive

to join-based algorithms, and is often used effectively in tree-based algorithms such

as MaxMiner , and MAFIA .

}

2.5

Hypercube Decomposition

One feasible way to reduce the computation cost of support counting is to find support

of multiple frequent patterns at one time. LCM [ 66 ] devise a technique referred to as

hypercube decomposition in this purpose. The multiple itemsets obtained at one time,

comprise a hypercube in the itemset lattice. Suppose that P is a frequent pattern,

tidset ( P ) contains the transactions that P is part of, and tail( P ) denotes the latest

item extension to the itemset P . H ( P ) is the set of items e satisfying e>tail ( P )

and tidset ( P )

tidset ( P

∪

e ). The set H ( P ) is referred to as the hypercube set.

Then, for any P

is frequent. The work in [ 66 ] uses this property in the candidate generation phase.

For two itemsets P and P

P )

⊆

H ( P ), tidset ( P

∪

tidset ( P ) is true, and P

∪

P , we say that P is between P and P

P if

∪

P . In the phase with respect to P , we output all P between P

⊆

∪

and P

∪

H ( P ). This technique saves significant time in counting.

Tree-Based Algorithms

The tree-based algorithm is based on set-enumeration concepts. The candidates can

be explored with the use of a subgraph of the lattice of itemsets (see Fig. 2.2 ), which

is also referred to as the lexicographic tree or enumeration tree [ 5 ]. These terms will,

Frequent Pattern Mining

Search WWH ::

Custom Search

Home