Full Perfect Extension Pruning for Frequent Subgraph Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

of spanning trees of the considered graphs/fragments (see [1] for more extensive

explanations).

Canonical code words are used in the search as follows: during the mining

process fragments are grown by adding an edge in each step. This edge is char-

acterized by the node from which it starts, by its attribute, and by the node it

leads to. (Note that this does not imply directed edges; the “source” and “des-

tination” node are solely defined by how the extension is done: the node that

is extended is the “source” and the other node, the extension edge is incident

to, is the “destination”.) In addition, the nodes are numbered in the order in

which they are added to the fragment. Hence the search process naturally con-

structs a code word for each grown fragment, namely by simply concatenating

the descriptions of the edge extensions that led to it.

Of course, there are many possible ways of building a fragment by adding

edges, each of which leads to a differentcodeword.However,thereisobvi-

ously only one way that leads to the canonical code word (since a code word

fixes a specific order of the extensions). Hence we may choose to extend only

those fragments further that have been built in such a way that their code word

is canonical. Eliminating all other fragments is called canonical form pruning ,

which obviously eliminates all redundant search: each fragment is considered at

most once. Note that the way in which the search process builds code words also

explains why we can confine ourselves to node numberings (and thus code words)

compatible with traversals of spanning trees (as mentioned above): no other code

words can be constructed by the search.

For the rest of this paper we focus on code words resulting from node number-

ings that are obtained by breadth-first traversals of spanning trees, that is, the

canonical form that is used in MoSS/MoFa [3]. Note, however, that the described

approach is also applicable for code words resulting from node numberings that

are obtained from depth-first traversals of spanning trees, that is, the canonical

form that is used in gSpan [16] or Closegraph [17]. The necessary adaptations of

the procedure are straightforward and thus not described in detail (they mainly

concern the form of edge descriptions).

A breadth-first code word has the general form a ( i s bai d ) m ,where a is node

attribute, b an edge attribute, i s the index (or number) of the source node of

an edge, and i d the index (or number) of the destination node of an edge (by

definition, it is always i s <i d ). The letter m denotes the number of edges of the

fragment. Each parenthesized expression describes one edge.

As an example, consider the left molecule shown in Figure 11.1. If this molecule

is built from left to right, that is, if we choose the left oxygen atom as the root

of a spanning tree, a possible code is O 0-S1 1-C2 2=O3 2-N4 . As can be seen

from this code word, the bond added first is the one from the oxygen atom

(index 0) to the sulfur atom (index 1), the bond added last is the one from

the carbon atom (index 2) to the nitrogen atom (index 4). However, there is

another possibility of building the same fragment, which leads to the code word

O 0-S1 1-C2 2-N3 2=O4 (that is, the last two bonds are added in inverse order).

If these two code words are compared lexicographically, the latter is smaller than

Mining Complex Data

Search WWH ::

Custom Search

Home