Database Reference
In-Depth Information
is required to process
| T | /p transactions, while in DD, each process is responsible
for
. Since each transaction processed will generate the same number of k -size
candidate itemsets, the computational cost for each algorithm at iteration k can be
modeled as follows:
| T |
Cost CD = |
T
p ×
f ( M k ),
and
f M k
p
.
Cost DD =|
T
In order for the DD algorithm to be as computationally efficient as the CD algorithm,
f ( M k /p ) must be less than f ( M k ) by a factor of p . However, Han et al. [ 19 ]
showed that f ( M k /p ) >f ( M k )
×
1 /p , thus the DD algorithm introduces redundant
computation.
The authors proposed a few optimizations to the distribution and processing of
transactions which can appropriately address these issues. When applied to the DD
algorithm, these optimizations are referred to as Intelligent Data Distribution (IDD).
In IDD, candidate itemsets are distributed to processes based on their prefix, fol-
lowing a lexicographic ordering or items. Each process is assigned a subset of I as
prefixes they are responsible for. A process can quickly check to see if a transaction
will generate any candidate itemsets which start with items from its subset of I and
skip non-matching transactions. Therefore, a process will only traverse its hash tree
for those candidate itemsets it is responsible for.
IDD uses a bin-packing algorithm [ 41 ] to ensure that each process receives approx-
imately the same number of candidate itemsets and all candidate itemsets assigned
to a particular process begin with the items of its subset of I . During each iteration,
the number of candidate itemsets starting with each frequent item are counted. Then,
the candidate itemsets are partitioned into p different bins, such that the sum of the
numbers of candidate itemsets starting with the items in each bin are roughly equal.
Longer prefixes can be considered if initial partitioning does not lead to evenly packed
bins, until the bins are evenly packed. The authors theoretically and experimentally
show that IDD addresses the memory bottleneck issues of the CD algorithm, as well
as the redundant work introduced by the DD algorithm.
In the Intelligent Data Distributed TreeProjection algoirthm, Agarwal et al. [ 4 ]
distributed the lexicographic tree by assigning to each process subtrees associated
with specific first items. The lexicographic tree is distributed s.t. the sum of the
supports of the subtrees assigned to a process is as close to balanced as possible.
Shintani and Kitsuregawa [ 52 ] also recognized the redundant work inherent to
the DD algorithm. Unlike the approach in IDD, the Hash Partitioned Apriori (HPA)
algorithm uses a hash function to determine the process responsible for a subset of
candidate itemsets. In other words, each process scans its portion of the transaction
set and generates k -size candidate itemsets. The algorithm hashes the itemsets and
determines the process responsible for computing their support. If an itemset is
hashed to its own process id, then the process increases support count for the itemset
in its own hash tree. Otherwise, the candidate itemset is sent to the process to which
it was hashed, so that its support count can be incremented there.
 
Search WWH ::




Custom Search