Database Reference
In-Depth Information
2.1
Apriori Method
The most basic join-based algorithm is the Apriori method [ 1 ]. The Apriori approach
uses a level-wise approach in which all frequent itemsets of length k are generated
before those of length ( k
1). The main observation which is used for the Apriori
algorithm is that every subset of a frequent pattern is also frequent. Therefore, can-
didates for frequent patterns of length ( k
+
1) can be generated from known frequent
patterns of length k with the use of joins. A join is defined by pairs of frequent k -
patterns that have at least ( k
+
1) items in common. Specifically, consider a frequent
pattern
that is frequent, but has not yet been discovered because only
itemsets of length 3 have been discovered so far. In this case, because the patterns
{
{
i 1 , i 2 , i 3 , i 4 }
i 1 , i 2 , i 3 }
and
{
i 1 , i 2 , i 4 }
are frequent, they will be present in the set
F 3 of all frequent
patterns with length k
2 items in
common. By performing a join on this pair, it is possible to create the candidate pat-
tern
=
3. Note that this particular pair also has k
1
=
. This pattern is referred to as a candidate because it might possibly
be frequent, and one most either rule it in or rule it out by support counting. There-
fore, this candidate is then validated against the transaction database by counting its
support. Clearly, the design of an efficient support counting method plays a critical
role in the overall efficiency of the process. Furthermore, it is important to note that
the same candidate can be produced by joining multiple frequent patterns. For ex-
ample, one might join
{
i 1 , i 2 , i 3 , i 4 }
to achieve the same result. Therefore,
in order to avoid duplication in candidate generation, two itemsets are joined only
whether first ( k
{
i 1 , i 2 , i 3 }
and
{
i 2 , i 3 , i 4 }
1) items are the same, based on a lexicographic ordering imposed
on the items. This provides all the ( k +
1)-candidates in a non-redundant way.
It should be pointed out that some candidates can be pruned out in an efficient way,
without validating them against the transaction database. For any ( k
1)-candidates,
it is checked whether all its k subsets are frequent. Although it is already known that
two of its subsets contributing to the join are frequent, it is not known whether its
remaining subsets are frequent. If all its subsets are not frequent, then the candidate
can be pruned from consideration because of the downward closure property. This is
known as the Apriori pruning trick. For example, in the previous case, if the itemset
{
+
does not exist in the set of frequent 3-itemsets which have already been
found, then the candidate itemset
i 1 , i 3 , i 4 }
can be pruned from consideration with
no further computational effort. This greatly speeds up the overall algorithm. The
generation of 1-itemsets and 2-itemsets is usually performed in a specialized way
with more efficient techniques.
Therefore, the basic Apriori algorithm can be described recursively in level-wise
fashion. the overall algorithm comprises of three steps that are repeated over and
over again, for different values of k , where k is the length of the pattern generated in
the current iteration. The four steps are those of (i) generation of candidate patterns
C k + 1 by using joins on the patterns in
{
i 1 , i 2 , i 3 , i 4 }
F k , (ii) the pruning of candidates from
C k + 1 ,
F k , and (iii) the validation of the patterns in
C k + 1
for which all subsets to not lie in
against the transaction database
C k + 1 which is truly
frequent. The algorithm is terminated, when the set of frequent k -patterns
T
, to determine the subset of
F k in a
given iteration is empty. The pseudo-code of the overall procedure is presented in
Fig. 2.3 .
Search WWH ::




Custom Search