Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

support on the local portion of

. Ray and Holder [ 47 ] extended this work, propos-

ing SP-SUBDUE-2, in which each process first searches their locally discovered

subgraphs before doing expensive isomorphism tests across their partition of

G

.

Parallelism for pattern growth methods is commonly achieved by exploiting the in-

dependent nature of the subtrees before expansion. One of the most straight-forward

ways to partition work is to assign adjacent subtrees to processes and explore each

one independently. This essentially operates as if a separate FGM program is being

executed on each process. Meinl et al. [ 37 ] took this approach in their shared memory

implementations of gSpan and MoFa.

Within the shared memory system context, Reinhardt and Karypis [ 48 ] created

a parallel implementation of VSIGRAM, which works on a single large graph. Un-

like SUBDUE and its parallel extensions, it is not an approximate algorithm. Their

implementation features both coarse and fine-grained work partitioning: subgraphs

can be extended vertically in parallel, and parallelism can also be exploited in the

support count of a single subgraph.

G

6.2.3

Dynamic Load Balancing

Load balancing for FGM is a notoriously difficult problem. Not only is the search

space highly irregular, but accurately estimating the workload of a task is very diffi-

cult. Di Fatta and Berthold [ 14 ] showed that the time required for subtree exploration

in a biological dataset follows a power-law distribution.

Shared memory systems offer good opportunities for cheap dynamic load bal-

ancing. Work queues can be easily accessed by all processes and requesting or

sending work is generally an inexpensive operation. Buehrer and Parthasarathy [ 8 ]

experimented with several work queue designs inside of a shared-memory parallel

implementation of gSpan. In their work, a task is defined as a tuple of the subgraph

to be mined and the list of graphs in

that it occurs in. They evaluated three dynamic

load balancing schemes: (1) a global work queue, in which all processes cooperate

to enqueue and dequeue tasks; (2) hierarchical queues, in which each process main-

tains a local private queue but enqueues to a shared one if their local queue becomes

full; and (3) distributed queues, in which each process maintains a private queue and

offers any additional work to idle processes if the local queue becomes full. Their

experiments showed that a distributed queueing model offered the most effective

load balancing.

Meinl et al. [ 37 ] evaluated another scheme for dynamic load balancing in their

shared memory implementations of MoFa and gSpan. Each process maintains a local

last-in-first-out (LIFO) work queue that can be shared with other processes. When

work needs to be distributed to another process, the stack is split in half. The authors

identify a heuristic for evenly splitting work. Since the work queue is LIFO, the

tasks at the bottom of the stack will often require more work because they were

discovered at the beginning of execution and therefore will have more subtrees to

explore. In an effort to provide a balanced work distribution, each process is assigned

every-other task. The authors also compared sender-initiated and receiver-initiated

G

Frequent Pattern Mining

Search WWH ::

Custom Search

Home