Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

the processes, in round-robin fashion. Guralnik and Karypis introduced several task-

parallel formulations, which are discussed in the next section, that partition the

problem into independent sub-problems via database projections.

Input partitioning is not inherently necessary for shared memory or MapReduce

distributed systems. In the case of shared memory systems, the input data should fit in

the aggregated system memory and is available to be read by all processes. However,

care must be taken to ensure processes do not simultaneously attempt to read or write

to the same block of memory. Zaki [ 65 ] extended his serial FSM algorithm (SPADE)

to the shared memory parallel architecture, creating pSPADE. Input data is assumed

residing on shared hard drive space, stored in the vertical-database format. The author

proposed two data parallel formulations in pSPADE that partition the input space s.t.

different processes are responsible for reading different sections of the input data ( id-

lists ). Processes are either assigned id-lists for a subset of sequences, or portions of

all id-lists associated with a range of sequences in

. Then, processes collaborate to

expand each node in the itemset lattice. The author found that these formulations lead

to poor performance due to high synchronization and memory overheads. He then

proposed two task distribution schemes, discussed in the following sections, which

are able to avoid read/write conflicts through independent search space sub-lattice

assignments.

The HDFS distributed file system provided by the Hadoop MapReduce framework

ensures adequate space exists for a large input dataset. However, data elements are

replicated across several processing nodes in the system and repeated scans of the data

will incur severe hard drive and/or network I/O penalties. Qiao et al. [ 44 ] found that a

straight-forward extension of GSP for a MapReduce distributed environment (DGSP)

performed poorly due to repeated scans of the input data. Instead, the authors used

the CD algorithm for 1-length candidate sequence counting in an algorithm called

PartSpan, followed by a projection-based task partitioning for solving the remainder

of the problem. Similar strategies were followed by Qiao et al. [ 45 ] in PLUTE and

Miliaraki et al. [ 38 ] in MG-FSM.

D

5.2.2

Work Partitioning

The FSM solution space is inherently irregular. Some processes may be assigned

subsections of the input with only short frequent sequences and in effect have less

work to complete than the rest of the processes. A number of authors have introduced

work partitioning schemes designed to combat these potential problems.

pSPADE [ 65 ] recursively decomposes the frequent sequence search space into

suffix-based equivalence classes, similar to those shown in Sect. 2 . A node's sub-

forest can be computed independently using only the node's input data (id-lists). Zaki

proposed a static task partitioning scheme, Static Load Balancing (SLB), in which

nodes in the first level of the class-tree and their associated id-lists are assigned in

a round-robin way to processes, in reverse order of the number of elements in each

node's class. This scheme cannot gauge well the amount of work that is needed to

mine each node's class and may still lead to load imbalance.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home