Chemistry Reference
In-Depth Information
same or newer nodes. Nonetheless, many duplicates are generated, with time-consuming
isomorphism tests as a consequence. Two extensions exist for MoFa; the first treats rings
as single units and the other treats chains of arbitrary length as a single unit. One of the
advantages of treating rings as single units becomes clear when fragmenting steroid struc-
tures. Normally, MoFa considers more than 300 000 fragments per steroid, whereas the
ring extension generates only 93 fragments. Another advantage is that the ambiguity of
aromatic bond representations in rings, either single or double, is circumvented.
In gSpan (Graph-based Substructure Pattern), [ 8 ] a canonical graph representation is used,
which is constructed from the concatenation of edge representations in the order in which
they are visited. To generate unique representations, the algorithm dictates a strict, depth-
first traversal of the subgraph lattice, hence the name 'depth-first search' code (dfs-code).
Since the string of concatenated edge/vector representations resembles the sequence of
letters in a word, graph representations are compared in the same way, that is, lexicograph-
ically. The elements of the string are sequentially compared until a mismatch is found or
if one string ends. Lower edge/vector labels precede higher ones; if all labels match, the
shorter string precedes the longer. The dfs-code of a fragment determines which nodes can
be extended, thereby restricting the number of refinements for that fragment. Only those
refinements are generated that have the smallest dfs-code. Appearance lists are used instead
of embeddings; hence subgraph isomorphism tests are still necessary for the graphs in these
appearance lists.
FFSM (Fast Frequent Subgraph Mining) [ 9 ] uses a canonical code, the Canonical Adja-
cency Matrix (CAM) code, to identify isomorphic graphs and to restrict refinement
generation. It is based on a matrix representation of the graph. By concatenating all entries
of the matrix, a string is formed that is used for lexicographic ordering of the graphs. FFSM
stores embeddings for the nodes only. In this way, embeddings are rapidly created for new
fragments made from joining or extension.
Gaston (GrAph/Sequence/Tree extractiON) [ 10 ] exploits the fact that various types of sub-
structures are contained in each other and that for the simple types more efficient algorithms
exist. First, only paths are considered in a substructure search. After that, paths are trans-
formed to trees and trees are searched. Finally, trees are transformed to general graphs with
cycles. This type of graph requires the most advanced and time consuming algorithms.
As stated before, finding subgraph isomorphisms is a laborious task compared with other
search problems and therefore time consuming. Therefore, they are only used, when they are
really needed. Gaston stores all embeddings, in order to restrict the generation of fragment
refinements to those that actually appear in the database and for isomorphism testing.
The tools have recently been compared and evaluated in the context of molecule min-
ing. [ 11 ] Wörlein et al . reimplemented all four methods (same code base, programming
expertise and optimization effort). Benchmarks were carried out on a comprehensive set
of graph databases, including molecular databases. The molecular databases used were
the IC93 (1283 compounds), [ 12 ] the HIV assays 1999 (42 689 compounds) [ 13 ] and the NCI
(237 771 compounds). [ 14 ]
The IC93 database served to investigate how the algorithms behaved when the number of
found fragments and the fragments themselves become large. For example, a support value
of 4% resulted in 37 727 fragments, of which the largest had 22 bonds. The HIV database
served to measure performance, whereas the NCI was used to test how the algorithms scale
with increasing database size. For this, molecules were randomly divided into sets of various
Search WWH ::




Custom Search