Database Reference
In-Depth Information
the processes, in round-robin fashion. Guralnik and Karypis introduced several task-
parallel formulations, which are discussed in the next section, that partition the
problem into independent sub-problems via database projections.
Input partitioning is not inherently necessary for shared memory or MapReduce
distributed systems. In the case of shared memory systems, the input data should fit in
the aggregated system memory and is available to be read by all processes. However,
care must be taken to ensure processes do not simultaneously attempt to read or write
to the same block of memory. Zaki [ 65 ] extended his serial FSM algorithm (SPADE)
to the shared memory parallel architecture, creating pSPADE. Input data is assumed
residing on shared hard drive space, stored in the vertical-database format. The author
proposed two data parallel formulations in pSPADE that partition the input space s.t.
different processes are responsible for reading different sections of the input data ( id-
lists ). Processes are either assigned id-lists for a subset of sequences, or portions of
all id-lists associated with a range of sequences in
. Then, processes collaborate to
expand each node in the itemset lattice. The author found that these formulations lead
to poor performance due to high synchronization and memory overheads. He then
proposed two task distribution schemes, discussed in the following sections, which
are able to avoid read/write conflicts through independent search space sub-lattice
assignments.
The HDFS distributed file system provided by the Hadoop MapReduce framework
ensures adequate space exists for a large input dataset. However, data elements are
replicated across several processing nodes in the system and repeated scans of the data
will incur severe hard drive and/or network I/O penalties. Qiao et al. [ 44 ] found that a
straight-forward extension of GSP for a MapReduce distributed environment (DGSP)
performed poorly due to repeated scans of the input data. Instead, the authors used
the CD algorithm for 1-length candidate sequence counting in an algorithm called
PartSpan, followed by a projection-based task partitioning for solving the remainder
of the problem. Similar strategies were followed by Qiao et al. [ 45 ] in PLUTE and
Miliaraki et al. [ 38 ] in MG-FSM.
D
5.2.2
Work Partitioning
The FSM solution space is inherently irregular. Some processes may be assigned
subsections of the input with only short frequent sequences and in effect have less
work to complete than the rest of the processes. A number of authors have introduced
work partitioning schemes designed to combat these potential problems.
pSPADE [ 65 ] recursively decomposes the frequent sequence search space into
suffix-based equivalence classes, similar to those shown in Sect. 2 . A node's sub-
forest can be computed independently using only the node's input data (id-lists). Zaki
proposed a static task partitioning scheme, Static Load Balancing (SLB), in which
nodes in the first level of the class-tree and their associated id-lists are assigned in
a round-robin way to processes, in reverse order of the number of elements in each
node's class. This scheme cannot gauge well the amount of work that is needed to
mine each node's class and may still lead to load imbalance.
Search WWH ::




Custom Search