Database Reference
In-Depth Information
T k
￿
Because each newly created transaction in
is much shorter,
this makes
subsequent support counting more efficient.
￿
In some cases, no candidate may be a subset of the transaction. Such a transaction
can be dropped from the database because it does not contribute to the counting
of support values.
￿
In other cases, more than one candidate may be a subset of the transaction, which
will actually increase the overhead of the algorithm. Clearly, this is not a desirable
scenario.
Thus, the first two factors improve the efficiency of the new representation, whereas
the last factor worsens it. Typically, the impact of the last factor is greater in the early
iterations, whereas the impact of the first two factors is greater in the later iterations.
Therefore, to maximize the overall efficiency, a natural approach would be to not
use this optimization in the early iterations, and apply it only in the later iterations.
This variation is referred to as the AprioriHybrid algorithm [ 1 ]. Another optimization
proposed in [ 9 ] is that the support of many patterns can be inferred from those of
key patterns in the data. This is used to significantly enhance the efficiency of the
approach.
Numerous other techniques have been proposed that use different techniques to
optimize the original implementation of the Apriori algorithm. As an example, the
method in [ 1 ] and [ 44 ] share a number of similarities but are somewhat different at
the implementation level. A work that combines the ideas from these different pieces
of work is presented in [ 3 ].
2.2
DHP Algorithm
The DHP algorithm, also known as the Direct Hashing and Pruning method [ 50 ],
was proposed soon after the Apriori method. It proposes two main optimizations to
speed up the algorithm. The first optimization is to prune the candidate itemsets in
each iteration, and the second optimization is to trim the transactions to make the
support-counting process more efficient.
To prune the itemsets, the algorithm tracks partial information about candidate
( k
1)-itemsets, while explicitly counting the support of candidate k -itemsets. During
the counting of candidate k -itemsets, all ( k
+
1) subsets of the transaction are found
and hashed into a table that maintains the counts of the number of subsets hashed
into each entry. During the phase of counting ( k
+
1)-itemsets, the counts in the hash
table are retrieved for each itemset. Clearly, these counts are overestimates because
of possible collisions in the hash table. Those itemsets for which the counts are below
the user-specified support level are then pruned from consideration.
A second optimization proposed in DHP is that of transaction trimming. A key
observation here is that if an item does not appear in at least k frequent itemsets in
F k , then no frequent itemset in
+
F k + 1 will contain that item. This follows from the fact
that there should be at least k (immediate) subsets of each frequent pattern in
F k + 1
Search WWH ::




Custom Search