Database Reference
In-Depth Information
1
Introduction
Since the introduction of association mining in [ 3 ], there have been many studies on
efficient and scalable frequent pattern mining algorithms. A milestone in these studies
is the development of an Apriori-based, level-wise mining method for associations [ 1 ,
23 ], which has sparked the development of various kinds of Apriori-like association
mining algorithms, as well as its extensions to mining correlation [ 8 ], causality [ 34 ],
sequential patterns [ 2 ], episodes [ 24 ], max-patterns [ 5 ], constraint-based mining [ 15 ,
20 , 25 , 36 ], associative classification [ 22 ], cyclic association rules [ 26 ], ratio rules
[ 19 ], iceberg queries and iceberg cubes [ 7 , 13 ], partial periodicity [ 16 ], emerging
patterns [ 12 ], and many other patterns.
There is an important, common ground among all these methods developed: the
use of an anti-monotone Apriori property of frequent patterns [ 1 ]: if any length-k
pattern is not frequent in the database, none of its length- ( k
1) super-patterns can
be frequent . This property leads to the powerful pruning of the set of itemsets to be
examined in the search for longer frequent patterns based on the existing ones.
Besides applying the Apriori property, most of the developed methods adopt a
level-wise, candidate generation-and-test approach, which scans the database mul-
tiple times (although there have been many techniques developed for reducing the
number of database scans). The first scan finds all of the length-1 frequent patterns.
The k th (for k> 1) scan starts with a seed set of length-( k
+
1) frequent patterns found
in the previous pass and generates new potential length- k patterns, called candidate
patterns . The k th scan of the database finds the support of every length k candidate
pattern. The candidates which pass the minimum support threshold are identified
as frequent patterns and become the seed set for the next pass. The computation
terminates when there is no frequent pattern found or there is no candidate pattern
that can be generated in any pass.
The candidate generation approach achieves good performance by reducing
the number of candidates to be generated. However, when the minimum support
threshold is low or the length of the patterns to be generated is long, the can-
didate generation-based algorithm may still bear the following non-trivial costs,
independent of detailed implementation techniques.
1. The number of candidates to be generated may still be huge, especially when the
length of the patterns to be generated is long. For example, to generate one fre-
quent pattern of length 100, such as { a 1 , a 2 , ... , a 100 }, the number of candidates
that has to be generated will be at least 100
i = 1
100
i
10 30 .
2. Each scan of the database examines the entire database against the whole set
of current candidates, which is quite costly when the database is large and the
number of candidates to be examined is numerous.
=
2 100
1
Search WWH ::




Custom Search