Big Data Frequent Pattern Mining - Frequent Pattern Mining - page 247

Database Reference

In-Depth Information

e

c

d

c

e

b

a

c

c

d

e

c

e

d

e

a

b

d

b

a

a

d

b

b

c

b

e

a

c

c

a

b

c

Fig. 10.5 Example of the FGM problem using both a set of graphs and a large graph as input.

The embedding of the candidate subgraph in ( a ) is highlighted by bold lines and gray nodes in the

example databases in ( b ) and ( c )

be frequent. This can greatly reduce the number of infrequent subgraphs that need

to be pruned.

Pattern growth methods, which traverse the pattern lattice in depth-first order,

have been quite successful in solving the FGM problem. They have smaller memory

footprints than candidate generation ones, because only subgraphs located on the

path from the start of the lattice to the current pattern being explored need to be

kept in memory. A challenge that pattern growth methods must face, however, is the

added risk of duplicate generation during exploration.

Yan and Han [ 60 ] developed the first pattern growth FGM method, named graph-

based Substructure pattern mining (gSpan). It avoids duplicates by only expanding

subtrees which lie on the rightmost path in the depth-first traversal. A major speedup

in gSpan is obtained by searching for the next extensions to make at the same time

as executing isomorphism tests.

Molecular Fragment miner (MoFa), introduced by Borgelt and Berthold [ 6 ], is

a pattern-growth method that was designed specifically for mining molecular sub-

structures, but is also capable of general FGM. MoFa maintains embedding lists ,

which are pointers to the exact locations within

where frequent subgraphs are

found. Embedding lists trade off a large amount of memory for the need to perform

isomorphism checks across

G

when a new subgraph is explored. To avoid duplicates,

MoFa maintains a hash table of all subgraphs that have been expanded and checks

whether a graph exists in the hash table before considering it as an extension. A graph

G is hashed by first transforming it to a string representation unique to G and graphs

isomorphic to G .

Some FGM algorithms define

G

as a single large graph, instead of the typical set

of graphs. In this formulation, the support of a graph G is defined as the number

of times G appears as a subgraph in

G

. Figure 10.5 portrays the difference between

the alternate problem definitions. A complication that appears when considering

G

Next Page

Frequent Pattern Mining

Search WWH ::

Custom Search

Home