Database Reference
In-Depth Information
e
c
d
c
e
b
a
c
c
d
e
c
e
d
e
a
b
d
b
a
a
d
b
b
c
b
e
a
c
c
a
b
c
Fig. 10.5 Example of the FGM problem using both a set of graphs and a large graph as input.
The embedding of the candidate subgraph in ( a ) is highlighted by bold lines and gray nodes in the
example databases in ( b ) and ( c )
be frequent. This can greatly reduce the number of infrequent subgraphs that need
to be pruned.
Pattern growth methods, which traverse the pattern lattice in depth-first order,
have been quite successful in solving the FGM problem. They have smaller memory
footprints than candidate generation ones, because only subgraphs located on the
path from the start of the lattice to the current pattern being explored need to be
kept in memory. A challenge that pattern growth methods must face, however, is the
added risk of duplicate generation during exploration.
Yan and Han [ 60 ] developed the first pattern growth FGM method, named graph-
based Substructure pattern mining (gSpan). It avoids duplicates by only expanding
subtrees which lie on the rightmost path in the depth-first traversal. A major speedup
in gSpan is obtained by searching for the next extensions to make at the same time
as executing isomorphism tests.
Molecular Fragment miner (MoFa), introduced by Borgelt and Berthold [ 6 ], is
a pattern-growth method that was designed specifically for mining molecular sub-
structures, but is also capable of general FGM. MoFa maintains embedding lists ,
which are pointers to the exact locations within
where frequent subgraphs are
found. Embedding lists trade off a large amount of memory for the need to perform
isomorphism checks across
G
when a new subgraph is explored. To avoid duplicates,
MoFa maintains a hash table of all subgraphs that have been expanded and checks
whether a graph exists in the hash table before considering it as an extension. A graph
G is hashed by first transforming it to a string representation unique to G and graphs
isomorphic to G .
Some FGM algorithms define
G
as a single large graph, instead of the typical set
of graphs. In this formulation, the support of a graph G is defined as the number
of times G appears as a subgraph in
G
. Figure 10.5 portrays the difference between
the alternate problem definitions. A complication that appears when considering
G
Search WWH ::




Custom Search