Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

boundaries will be missed. The algorithm attempts to minimize these lost subgraphs

by assigning frequent edges large weights before partitioning, leading to fewer cuts

across them.

MapReduce offers a framework for developing parallel programs that operate

gracefully using secondary non-volatile storage. Secondary storage devices, such as

hard disk drives, are almost always larger than the amount of random access memory

available on a machine. MapReduce-based Pattern Finding (MRPF) by Liu et al. [ 34 ]

is a MapReduce framework for the FGM problem that assumes

is a single graph.

It employs a candidate generation scheme that uses two iterations of MapReduce to

explore each level of the search lattice. In the first iteration, the MAP phase generates

size- k candidates. The following REDUCE phase detects duplicates and removes them.

Finally, the second MapReduce pass computes the support of new candidates. Lu

et al. [ 35 ] developed MapReduce based Frequent Subgraph Extraction (MRFSE), a

candidate generation MapReduce implementation for the traditional many-graph

G

.

On iteration k ,aMAP task takes one graph from

and emits size- k candidates. Each

REDUCE task takes all of the appearances of a candidate and computes its support.

MRFSE also stores embedding lists to maintain the location of all appearances of

frequent subgraphs in

G

. By doing so, isomorphism tests are no longer necessary, at

the cost of additional memory usage.

G

6.2.2

Work Partitioning

Candidate generation methods consider a different set of candidates at each itera-

tion. Much like the parallel Apriori methods seen in Sect. 4 , the candidate generation

process can easily be decomposed into parallel tasks. Wang and Parthasarathy de-

veloped the MotifMiner Toolkit for distributed memory systems [ 57 ]. MotifMiner is

designed for both the case when

is a single graph. It

has a similar structure as the Data Distribution method from Sect. 4 .

G

is a set of graphs and when

G

is assumed to

be available to all processes and each process is assigned some subset of candidate

generation to perform. Once all processes have generated their local candidate sets,

each process shares its local candidates with all others.

Many parallel candidate generation methods identify the individual candidates to

expand as independent tasks. MRFSE, a MapReduce algorithm by Lu et al. [ 35 ],

assigns instead each graph in

G

as a task. The authors noted that the simple assignment

of the same number of graphs to each MAP task is not a good work partitioning scheme

because graphs in

G

can vary greatly in size. Instead, they count the edges of each

graph in

G

and assign tasks such that each MAP is assigned a roughly equal number

of edges.

SP-SUBDUE partitions the single graph

across processes in a distributed set-

ting. Each process then runs an independent instance of SUBDUE and support counts

are finalized by broadcasting the frequent subgraphs to all other processes. When

a process receives a new frequent subgraph from another partition, it computes its

G

Frequent Pattern Mining

Search WWH ::

Custom Search

Home