Information Technology Reference
In-Depth Information
5.
GL
=
{
X
CG
|
X
minsup
×
|
D
|,
X
minsup
×
|
D
|}
compute
k
i
k
.
sup
.
sup(
i
i
for all
i
,1≤
i
n
n
L
= G 1
GL
6. return
k
i
=
k i
( )
12.7.3 DIC-based algorithm
The DIC algorithm proposed by Sergey Brin and others is a generalization of
Apriori (Sergey, 1997). The database is divided into p equal-sized partitions so
that each partition fits in memory. For partition 1, DIC gathers the supports of
single items. Items found to be locally frequent (only in this partition) generate
candidate 2-itemsets. Then DIC reads partition 2 and obtains supports for all
current candidates. That is, the single items and the candidate 2-itemsets. This
process repeats for the remaining partitions. DIC starts counting candidate
k-itemsets while processing partition k in the first database scan. After the last
partition p has been processed, the processing wraps around to partition 1 again.
A candidate's global support is known once the processing wraps around the
database and reaches the partition where it was first generated.
DIC is effective in reducing the number of database scans if most partitions
are homogeneous. If data is not homogeneous, DIC might generate many false
positives (itemsets that are locally frequent but not globally frequent) and scan
the database more than Apriori does. DIC proposes a random partitioning
technique to reduce the data-partition skew.
Cheung and colleagues have proposed the Asynchronous Parallel Mining
algorithm, which is based on DIC(Cheung et al, 1998). APM uses FDM's
global-pruning technique to decrease the size of candidate 2-itemsets. This
pruning is most effective when there is high data skew among the partitions.
However, DIC requires that the partitions be homogeneous.
APM addresses this problem by treating the first iteration separately. APM
logically divides the database into many small, equal-sized virtual partitions. The
number of virtual partitions l is independent of the number of processors p, but
usually l ² p. Let m be the number of items. APM gathers the local counts of the
mitems in each partition. This forms an l × mdata set, with l item support vectors
in an m-dimensional space. APM groups these l vectors into k clusters,
maximizing intercluster distance and minimizing intracluster distance. Thus, the
Search WWH ::




Custom Search