Constraint-Based Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

Essentially, compared to the earlier level-wise algorithm, this algorithm traverses the

search space in a different order in which some long patterns are already considered

before some shorter patterns are evaluated. As a result, optimizations based on the

fact that short patterns have been seen before long patterns are not used. In practice,

however, these algorithms can be more efficient. The reason for this is that most

implementations take care to maintain datastructures which allow for incremental

constraint evaluation: for instance, to calculate the support of a pattern, they do

not traverse the whole dataset, but only consider those transactions covered by the

pattern's parent in the search tree. As depth-first algorithms do not need to maintain

a large number of candidates, maintaining such additional data structures is feasible.

A well-known datastructure in this context is the FP-Tree (see the chapter on pattern

growth for more details) [ 27 ].

Note that the above algorithm works for any pattern language, including graphs,

strings and trees, as long as we know that the constraint ϕ is anti-monotonic.

When some constraints are not anti-monotonic, a basic approach for dealing with

them, as in the case of breadth-first search, is to ignore them during the search

and post-process the output of the above agorithm. A similar trick can be used for

boundable constraints. In this case, the anti-monotonic bound is used during the

depth-first search, and each pattern found is finally evaluated using the original

constraint in a post-processing step.

Many studies have explored the possibilities for deriving more efficient algorithms

for more complex constraints than anti-monotonic constraints. Most of these studies

have focused on the pattern language of itemsets , as it appears additional pruning is

most easily derived for itemsets. We will discuss these approaches in a generic way

in the next paragraph, inspired by work of Bucila et al. and Guns et al. [ 8 , 13 ].

4.2

Constraint-based Itemset Mining

The key idea in efficient depth-first constraint-based itemset mining algorithms is to

maintain four sets in each node of the search tree, which are upper- and lower-bounds

on the itemsets and transaction sets that can still be found:

I U , the largest itemset we believe we can still find;

I L , the smallest itemset we believe we can still find;

T U , the largest transaction set we believe we can still find;

T L , the smallest transaction set we believe we can still find.

For some constraints, not all these 4 sets need to be maintained, but in the most

generic setting all 4 sets are maintained.

During the search, any modification of any of these 4 sets may be a reason to

modify another of these 4 sets as well. We will refer to this process of modifying one

set based on the modification of another set as propagation . Different approaches

differ in the algorithms and data structures used to do propagation.

An overview of the generic algorithm is given in Algorithm 3, in which I L

=

T L =∅

, I U

= I

and T U

= D

. Line 1 performs the propagation for the constraints.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home