Database Reference
In-Depth Information
passes comes at an increased computational cost. Therefore, it was proposed in [ 1 ]
that the approach should be used for later passes, when the number of candidates
has already reduced significantly. This reduces the likelihood that the number of
candidates blows up too much with this approach.
7.2
Sampling Tricks
A number of sampling tricks can be used to greatly improve the efficiency of the
frequent pattern mining process. Most sampling methods require two passes over
the data, the first of which is used for sampling. An interesting approach that uses
two passes with the use of sampling is discussed in [ 65 ]. This method generates the
approximately frequent patterns over the data, using a sample. False negatives can
be reduced by lowering the minimum support level appropriately, so that bounds can
be defined on the likelihood of false negatives. False positives can be removed with
the use of a second pass over the data. The major downside of the approach is that
the reduction in the minimum support level to reduce the number of false negatives
can be significant. This also reduces the computational efficiency of the approach.
The method however requires only two passes over the data, where the first pass is
used to create the sample, and the second pass is used to remove the false positives.
An interesting approach proposed in [ 57 ] divides the disk resident database into
smaller memory-resident partitions. For each partition, more efficiency algorithms
can be used, because of the memory-resident nature of the partition. It should be
pointed out that each frequent pattern over the entire database will appear as a fre-
quent pattern in at least one transaction. Therefore, the union of the itemsets over
the different transactions provides a superset of the true frequent patterns. A post-
processing phase is then used to filter out the spurious itemsets, by counting this
candidate set against the transaction database. As long as the partitions are reason-
ably large, the superset found approximates the true frequent patterns very well,
and therefore the additional time spent in counting irrelevant candidates is relatively
small. The main advantage of this approach is it requires only two passes over the
database. Therefore, such an approach is particularly effective when the data is
resident on disk.
The Dynamic Itemset Counting (DIC) algorithm [ 15 ] divides the database into
intervals, and generates longer candidates when it is known that the subsets of these
candidates are already frequent. These are then validated over the database. Such
an approach can reduce the number of passes over the data, because it implicitly
combines the process of candidate generation and counting.
Search WWH ::




Custom Search