Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

support in a single graph is that some embeddings of a subgraph G could be overlap-

ping. Kuramochi and Karypis [ 30 ] show that, if overlapping subgraphs are counted,

the problem is no longer downward closed. Thus, they only consider edge-disjoint

embeddings, i.e., those embeddings of G that do not have any shared edges in

.

SUBDUE, by Holder et al. [ 24 ], operates on a single graph and was one of the first

developed FGM tools. It uses a greedy candidate generation approach and bounds

the cost of doing isomorphism checks by doing inexact graph matches. Due to its

greedy nature and inexact matches, SUBDUE is an approximate algorithm that does

not find the complete set of frequent subgraphs. Kuramochi and Karypis [ 30 ] also

addressed the scenario of

G

being a single graph. They presented two algorithms,

Horizontal Single Graph Miner (HSIGRAM) and Vertical Single Graph Miner (VSI-

GRAM), which are candidate generation and pattern growth methods, respectively.

HSIGRAM and VSIGRAM are complete FGM methods that also have the ability to

do approximate counting. In order to reduce the time for frequency counting on each

lattice level, HSIGRAM makes use of a modified form of embedding lists: instead of

storing each embedding found in

G

, a single edge called the anchor-edge is stored.

This significantly reduces the memory expense of full embedding lists at the cost of

some recomputation for each embedding.

G

6.2

Parallel Frequent Graph Mining

The exponential cost of subgraph isomorphism inherent in FGM makes parallelism

for this problem vital for Big Data input. We now describe existing parallel FGM

(PFGM) methods, focusing on how they address the issues of memory scalability,

work partitioning, and dynamic load balancing.

6.2.1

Memory Scalability

Many of the PFIM and PFSM techniques for achieving memory scalability are also

applicable in PFGM algorithms. Methods commonly require access to the entirety

of

. This has performance implications for both shared and distributed memory

systems. Meinl et al. [ 37 ] created parallel versions of gSpan and MoFa for a shared

memory system. In both algorithms,

G

is globally accessible to all processes, but

could be split across multiple machines, causing unexpected network delays. Ad-

ditionally, MoFa must be able to keep track of the already generated subgraphs in

order to avoid generating duplicates.

Cook et al. [ 11 ] developed Static Partitioning SUBDUE (SP-SUBDUE), a par-

allel, message-passing implementation of SUBDUE. Like its serial predecessor,

SP-SUBDUE expects

G

to be a single large graph. It uses a weighted graph par-

titioning algorithm [ 28 ] to divide

G

into as many partitions as there are processes.

Each process can then store only their local portion of

G

in memory. The conse-

quence of partitioning

G

is that any frequent subgraphs that exist across partition

Frequent Pattern Mining

Search WWH ::

Custom Search

Home