Information Technology Reference
In-Depth Information
the former (assuming that single bonds are “smaller” than double bonds, that is,
- < = ). Therefore we can conclude that the first code word is not the canonical
code word, but neither is the second. The canonical code word for this molecule
is actually S 0-C1 0-O2 1-N3 1=O4 (if we use the element order S < C < N < O
and the bond order - < = , as we also do for all following examples in order to
avoid confusion).
Note, however, that the canonical code word is C 0-N1 0-S2 0=O3 2-O4 if we
use the order of the periodic table of elements (that is, C < N < O < S , together
with - < = ), showing that which code word is the canonical one also depends
on the order we choose for the node and edge attributes. Empirical evidence
suggests that it is recommendable to use an order that reflects the frequency
of the attributes in the graph database to mine (less frequent attributes should
precede more frequent ones), as this usually leads to fewer generated fragments
and thus shorter search times.
Note also that (canonical) code words for graph fragments provide a natural
way of ordering the fragments in the search tree: the children of a search tree
node are listed from left to right in the order of lexicographically increasing code
words. This makes precise what we mean by “to the left” and “to the right” of
a search tree branch: “to the left” are fragments with smaller, “to the right” are
fragments with greater (canonical) code words.
Checking whether a given code word is canonical usually requires testing all
possible code words for a fragment (at least w.r.t. all possible node numberings
resulting from traversals of spanning trees) and thus has essentially the same
complexity as a graph isomorphism test. (Pseudo-code for such a canonical form
check can be found, for example, in [1].) Nevertheless, canonical code words are
very effective in pruning the search tree, because they use “global” information in
contrast to only “local” rules, as they were used originally in [3]. These “local” or
“simple” rules, however, can still be applied to support canonical form pruning,
as they specify necessary (though not sucient) prerequisites for code words
to be canonical, which can be tested very eciently and help to avoid a costly
canonical form test in many cases.
For example, if we use a breadth-first (spanning tree traversal) canonical form
(as it was described above), one may not extend a node that has an index smaller
than another node in the fragment, which has already been extended ( maximum
source extension : only nodes with an index no less than the maximum source
index may be extended). The reason is that an extension violating this rule
necessarily leads to a non-canonical code word, as can easily be checked with a
spanning tree rooted at the same node.
As an example consider the fragment S-C-N in the search tree in Figure 11.2:
this fragment may not be extended by an edge from the sulfur atom (index 0)
to an oxygen atom, because an atom with a higher index, namely the car-
bon atom (index 1), has already been extended (by attaching the nitrogen
atom). Indeed, if we add such an edge, the code word of the resulting frag-
ment is S 0-C1 1-N2 0-O3 , while the canonical code word for this fragment is
S 0-C1 0-O2 1-N3 (using again the element order S < C < N < O ).
 
Search WWH ::




Custom Search