Full Perfect Extension Pruning for Frequent Subgraph Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

the former (assuming that single bonds are “smaller” than double bonds, that is,

- < = ). Therefore we can conclude that the first code word is not the canonical

code word, but neither is the second. The canonical code word for this molecule

is actually S 0-C1 0-O2 1-N3 1=O4 (if we use the element order S < C < N < O

and the bond order - < = , as we also do for all following examples in order to

avoid confusion).

Note, however, that the canonical code word is C 0-N1 0-S2 0=O3 2-O4 if we

use the order of the periodic table of elements (that is, C < N < O < S , together

with - < = ), showing that which code word is the canonical one also depends

on the order we choose for the node and edge attributes. Empirical evidence

suggests that it is recommendable to use an order that reflects the frequency

of the attributes in the graph database to mine (less frequent attributes should

precede more frequent ones), as this usually leads to fewer generated fragments

and thus shorter search times.

Note also that (canonical) code words for graph fragments provide a natural

way of ordering the fragments in the search tree: the children of a search tree

node are listed from left to right in the order of lexicographically increasing code

words. This makes precise what we mean by “to the left” and “to the right” of

a search tree branch: “to the left” are fragments with smaller, “to the right” are

fragments with greater (canonical) code words.

Checking whether a given code word is canonical usually requires testing all

possible code words for a fragment (at least w.r.t. all possible node numberings

resulting from traversals of spanning trees) and thus has essentially the same

complexity as a graph isomorphism test. (Pseudo-code for such a canonical form

check can be found, for example, in [1].) Nevertheless, canonical code words are

very effective in pruning the search tree, because they use “global” information in

contrast to only “local” rules, as they were used originally in [3]. These “local” or

“simple” rules, however, can still be applied to support canonical form pruning,

as they specify necessary (though not sucient) prerequisites for code words

to be canonical, which can be tested very eciently and help to avoid a costly

canonical form test in many cases.

For example, if we use a breadth-first (spanning tree traversal) canonical form

(as it was described above), one may not extend a node that has an index smaller

than another node in the fragment, which has already been extended ( maximum

source extension : only nodes with an index no less than the maximum source

index may be extended). The reason is that an extension violating this rule

necessarily leads to a non-canonical code word, as can easily be checked with a

spanning tree rooted at the same node.

As an example consider the fragment S-C-N in the search tree in Figure 11.2:

this fragment may not be extended by an edge from the sulfur atom (index 0)

to an oxygen atom, because an atom with a higher index, namely the car-

bon atom (index 1), has already been extended (by attaching the nitrogen

atom). Indeed, if we add such an edge, the code word of the resulting frag-

ment is S 0-C1 1-N2 0-O3 , while the canonical code word for this fragment is

S 0-C1 0-O2 1-N3 (using again the element order S < C < N < O ).

Mining Complex Data

Search WWH ::

Custom Search

Home