Association Rules - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

5.

GL

=

{

X

∈

CG

|

X

≥

minsup

×

|

D

|,

X

≥

minsup

×

|

D

|}

compute

k

i

k

.

sup

.

sup(

i

for all

i

,1≤

i ≤

n

L

= G 1

GL

6. return

k

i

=

k i

( )

12.7.3 DIC-based algorithm

The DIC algorithm proposed by Sergey Brin and others is a generalization of

Apriori (Sergey, 1997). The database is divided into p equal-sized partitions so

that each partition fits in memory. For partition 1, DIC gathers the supports of

single items. Items found to be locally frequent (only in this partition) generate

candidate 2-itemsets. Then DIC reads partition 2 and obtains supports for all

current candidates. That is, the single items and the candidate 2-itemsets. This

process repeats for the remaining partitions. DIC starts counting candidate

k-itemsets while processing partition k in the first database scan. After the last

partition p has been processed, the processing wraps around to partition 1 again.

A candidate's global support is known once the processing wraps around the

database and reaches the partition where it was first generated.

DIC is effective in reducing the number of database scans if most partitions

are homogeneous. If data is not homogeneous, DIC might generate many false

positives (itemsets that are locally frequent but not globally frequent) and scan

the database more than Apriori does. DIC proposes a random partitioning

technique to reduce the data-partition skew.

Cheung and colleagues have proposed the Asynchronous Parallel Mining

algorithm, which is based on DIC(Cheung et al, 1998). APM uses FDM's

global-pruning technique to decrease the size of candidate 2-itemsets. This

pruning is most effective when there is high data skew among the partitions.

However, DIC requires that the partitions be homogeneous.

APM addresses this problem by treating the first iteration separately. APM

logically divides the database into many small, equal-sized virtual partitions. The

number of virtual partitions l is independent of the number of processors p, but

usually l ² p. Let m be the number of items. APM gathers the local counts of the

mitems in each partition. This forms an l × mdata set, with l item support vectors

in an m-dimensional space. APM groups these l vectors into k clusters,

maximizing intercluster distance and minimizing intracluster distance. Thus, the

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home