Information Technology Reference
In-Depth Information
3. {scan one more interval on D i and count the supports of the itemsets on the
trie;
4. found out the locally large itemsets among the itemsets on the trie for the interval
scanned;
5. generate new candidates from these locally large itemsets;
6. perform pruning on these candidates and insert the survivors into the trie;
7. remove globally small itemsets on the trie;
8. }
9. }
Every processor in APM performs the dynamic generation and counting. The
new candidates generated by each process go through a pruning similar to the
partition pruning before the survivors are inserted into the trie. Before a
processor starts a new round of counting on the next interval,it traverses the trie
to remove itemsets which are found to be globally small. This will keep the size
of the trie minimum. If all the processors have counted all the itemsets on the trie,
then APM terminates.
12.7.4 Data skewness and workload balance
In a database partition, two data distribution characteristics, data skewness and
workload balance, have orthogonal effects on prunings and hence performance of
FPM. Intuitively, the data skewness of a partitioned database is high if the
supports of most large itemsets are clustered in a few partitions. It is low if the
supports of most large itemsets are distributed evenly across the processors.
David Cheung and Yongqiao Xiao have developed a skewness metric based
on the well established notion of entropy. Given a random variable
, it's entropy
is a measurement on how even or uneven its probability distribution is over its
values. If a database is partitioned over n processors, the value
X
X sup i
.
( )
px i
( )
=
X sup
.
can be regarded as the probability of occurrence of an itemset
X
in partition
D i ,
n
i
Ã
is a measurement of the
1≤
i
n.
The entropy
H(X)
= −
1 (
px i
( )
×
log(
px i
( )))
=
distribution of the local supports of
X
over the partitions For example, if
X
is
skewed completely into a single partition
D
k , (1≤
k
n
) , i.e., it only occurs in Dk,
then
p x (
k
) =1 and
p
x (
i
) =0, ∀ i
¬
k
. The value of H(
X
) = 0 is the minimal in this
Search WWH ::




Custom Search