Information Technology Reference
In-Depth Information
which the bonds have been added. Since this code word is not canonical, the
fragment would be pruned and neither extended nor reported.
In order to avoid this, we allow for a (strictly limited) reorganization of code
words as they result from a search tree, which takes care of the fact that perfect
extension edges may have been added earlier than required by the canonical
form. Technically, we split the code word into two parts: the first, fixed part
consists of the (possibly empty) prefix up to and including the last edge that
was added by a non-perfect extension or by a perfect extension with no search
tree branches to the left of it. The second, volatile part consists of the remaining
sux of the code word, which is made up only of perfect extensions edges, which
had search tree branches to the left of it.
Note that we can check for the existence of branches to the left of a perfect
extension branch after minimum support pruning , that is, after eliminating all
fragments that occur in less than the user-specified minimum number of database
graphs. The reason is that we can be sure that extensions leading to infrequent
fragments in branches to the left will also lead to infrequent fragments in the
perfect extension branch or in branches to the right of it and thus need not be
considered in these branches.
The construction of the code word for an extended fragment is modified as
follows: instead of always simply appending the description of the extension edge
to the end of a code word, the description of the new edge may now be inserted
anywhere in or even before the volatile part, but not in the fixed part. We may
imagine this as first appending the new edge description and then shifting it to
the left, as long as this makes the code word lexicographically smaller, but the
new edge description does not enter the fixed part.
Note, however, that “shifting” an edge in the code word can make it necessary
to renumber the nodes. For example, if in the fragment O-S-C-N the bond added
last in the search (that is, the bond from the sulfur atom to the oxygen atom) is
shifted left past the perfect extension bond (that is, the bond from the carbon
atom to the nitrogen atom), the oxygen and the nitrogen atom get new indices.
The reason is that the nodes must be numbered in the order in which they would
be added if the edges were added in the order in which their descriptions are
listed in the (reorganized) code word (see Figure 11.8).
Technically, we achieve this renumbering as follows: instead of actually shifting
the extension edge from right to left, we rebuild the code word from left to right.
First we traverse the fixed part, numbering all nodes in the order in which they
are met. Then we continue with the volatile part until at least one of the two
1. Base fragment: S-C-N
canonical code: S 0-C1 1-N2
2. Extension to O-S-C-N
code: S 0-C1 1-N2 0-O3 (not canonical!)
3. Shift the non-perfect extension
code: S 0-C1 0-O3 1-N2
4. Renumber nodes
canonical code: S 0-C1 0-O2 1-N3
Fig. 11.8. Fixing a fragment's code word by shifting a non-canonical extension over
perfect extensions (marked in gray) to the proper place and renumbering the nodes
 
Search WWH ::




Custom Search