Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

is required to process

| T | /p transactions, while in DD, each process is responsible

for

. Since each transaction processed will generate the same number of k -size

candidate itemsets, the computational cost for each algorithm at iteration k can be

modeled as follows:

| T |

Cost CD = |

T

p ×

f ( M k ),

and

f M k

p

.

Cost DD =|

T

|×

In order for the DD algorithm to be as computationally efficient as the CD algorithm,

f ( M k /p ) must be less than f ( M k ) by a factor of p . However, Han et al. [ 19 ]

showed that f ( M k /p ) >f ( M k )

×

1 /p , thus the DD algorithm introduces redundant

computation.

The authors proposed a few optimizations to the distribution and processing of

transactions which can appropriately address these issues. When applied to the DD

algorithm, these optimizations are referred to as Intelligent Data Distribution (IDD).

In IDD, candidate itemsets are distributed to processes based on their prefix, fol-

lowing a lexicographic ordering or items. Each process is assigned a subset of I as

prefixes they are responsible for. A process can quickly check to see if a transaction

will generate any candidate itemsets which start with items from its subset of I and

skip non-matching transactions. Therefore, a process will only traverse its hash tree

for those candidate itemsets it is responsible for.

IDD uses a bin-packing algorithm [ 41 ] to ensure that each process receives approx-

imately the same number of candidate itemsets and all candidate itemsets assigned

to a particular process begin with the items of its subset of I . During each iteration,

the number of candidate itemsets starting with each frequent item are counted. Then,

the candidate itemsets are partitioned into p different bins, such that the sum of the

numbers of candidate itemsets starting with the items in each bin are roughly equal.

Longer prefixes can be considered if initial partitioning does not lead to evenly packed

bins, until the bins are evenly packed. The authors theoretically and experimentally

show that IDD addresses the memory bottleneck issues of the CD algorithm, as well

as the redundant work introduced by the DD algorithm.

In the Intelligent Data Distributed TreeProjection algoirthm, Agarwal et al. [ 4 ]

distributed the lexicographic tree by assigning to each process subtrees associated

with specific first items. The lexicographic tree is distributed s.t. the sum of the

supports of the subtrees assigned to a process is as close to balanced as possible.

Shintani and Kitsuregawa [ 52 ] also recognized the redundant work inherent to

the DD algorithm. Unlike the approach in IDD, the Hash Partitioned Apriori (HPA)

algorithm uses a hash function to determine the process responsible for a subset of

candidate itemsets. In other words, each process scans its portion of the transaction

set and generates k -size candidate itemsets. The algorithm hashes the itemsets and

determines the process responsible for computing their support. If an itemset is

hashed to its own process id, then the process increases support count for the itemset

in its own hash tree. Otherwise, the candidate itemset is sent to the process to which

it was hashed, so that its support count can be incremented there.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home