Information Technology Reference
In-Depth Information
of spanning trees of the considered graphs/fragments (see [1] for more extensive
explanations).
Canonical code words are used in the search as follows: during the mining
process fragments are grown by adding an edge in each step. This edge is char-
acterized by the node from which it starts, by its attribute, and by the node it
leads to. (Note that this does not imply directed edges; the “source” and “des-
tination” node are solely defined by how the extension is done: the node that
is extended is the “source” and the other node, the extension edge is incident
to, is the “destination”.) In addition, the nodes are numbered in the order in
which they are added to the fragment. Hence the search process naturally con-
structs a code word for each grown fragment, namely by simply concatenating
the descriptions of the edge extensions that led to it.
Of course, there are many possible ways of building a fragment by adding
edges, each of which leads to a differentcodeword.However,thereisobvi-
ously only one way that leads to the canonical code word (since a code word
fixes a specific order of the extensions). Hence we may choose to extend only
those fragments further that have been built in such a way that their code word
is canonical. Eliminating all other fragments is called canonical form pruning ,
which obviously eliminates all redundant search: each fragment is considered at
most once. Note that the way in which the search process builds code words also
explains why we can confine ourselves to node numberings (and thus code words)
compatible with traversals of spanning trees (as mentioned above): no other code
words can be constructed by the search.
For the rest of this paper we focus on code words resulting from node number-
ings that are obtained by breadth-first traversals of spanning trees, that is, the
canonical form that is used in MoSS/MoFa [3]. Note, however, that the described
approach is also applicable for code words resulting from node numberings that
are obtained from depth-first traversals of spanning trees, that is, the canonical
form that is used in gSpan [16] or Closegraph [17]. The necessary adaptations of
the procedure are straightforward and thus not described in detail (they mainly
concern the form of edge descriptions).
A breadth-first code word has the general form a ( i s bai d ) m ,where a is node
attribute, b an edge attribute, i s the index (or number) of the source node of
an edge, and i d the index (or number) of the destination node of an edge (by
definition, it is always i s <i d ). The letter m denotes the number of edges of the
fragment. Each parenthesized expression describes one edge.
As an example, consider the left molecule shown in Figure 11.1. If this molecule
is built from left to right, that is, if we choose the left oxygen atom as the root
of a spanning tree, a possible code is O 0-S1 1-C2 2=O3 2-N4 . As can be seen
from this code word, the bond added first is the one from the oxygen atom
(index 0) to the sulfur atom (index 1), the bond added last is the one from
the carbon atom (index 2) to the nitrogen atom (index 4). However, there is
another possibility of building the same fragment, which leads to the code word
O 0-S1 1-C2 2-N3 2=O4 (that is, the last two bonds are added in inverse order).
If these two code words are compared lexicographically, the latter is smaller than
 
Search WWH ::




Custom Search