Computational Approaches to Fragment and Substructure Discovery and Evaluation - Fragment-Based Drug Discovery

Chemistry Reference

In-Depth Information

same or newer nodes. Nonetheless, many duplicates are generated, with time-consuming

isomorphism tests as a consequence. Two extensions exist for MoFa; the first treats rings

as single units and the other treats chains of arbitrary length as a single unit. One of the

advantages of treating rings as single units becomes clear when fragmenting steroid struc-

tures. Normally, MoFa considers more than 300 000 fragments per steroid, whereas the

ring extension generates only 93 fragments. Another advantage is that the ambiguity of

aromatic bond representations in rings, either single or double, is circumvented.

In gSpan (Graph-based Substructure Pattern), [ 8 ] a canonical graph representation is used,

which is constructed from the concatenation of edge representations in the order in which

they are visited. To generate unique representations, the algorithm dictates a strict, depth-

first traversal of the subgraph lattice, hence the name 'depth-first search' code (dfs-code).

Since the string of concatenated edge/vector representations resembles the sequence of

letters in a word, graph representations are compared in the same way, that is, lexicograph-

ically. The elements of the string are sequentially compared until a mismatch is found or

if one string ends. Lower edge/vector labels precede higher ones; if all labels match, the

shorter string precedes the longer. The dfs-code of a fragment determines which nodes can

be extended, thereby restricting the number of refinements for that fragment. Only those

refinements are generated that have the smallest dfs-code. Appearance lists are used instead

of embeddings; hence subgraph isomorphism tests are still necessary for the graphs in these

appearance lists.

FFSM (Fast Frequent Subgraph Mining) [ 9 ] uses a canonical code, the Canonical Adja-

cency Matrix (CAM) code, to identify isomorphic graphs and to restrict refinement

generation. It is based on a matrix representation of the graph. By concatenating all entries

of the matrix, a string is formed that is used for lexicographic ordering of the graphs. FFSM

stores embeddings for the nodes only. In this way, embeddings are rapidly created for new

fragments made from joining or extension.

Gaston (GrAph/Sequence/Tree extractiON) [ 10 ] exploits the fact that various types of sub-

structures are contained in each other and that for the simple types more efficient algorithms

exist. First, only paths are considered in a substructure search. After that, paths are trans-

formed to trees and trees are searched. Finally, trees are transformed to general graphs with

cycles. This type of graph requires the most advanced and time consuming algorithms.

As stated before, finding subgraph isomorphisms is a laborious task compared with other

search problems and therefore time consuming. Therefore, they are only used, when they are

really needed. Gaston stores all embeddings, in order to restrict the generation of fragment

refinements to those that actually appear in the database and for isomorphism testing.

The tools have recently been compared and evaluated in the context of molecule min-

ing. [ 11 ] Wörlein et al . reimplemented all four methods (same code base, programming

expertise and optimization effort). Benchmarks were carried out on a comprehensive set

of graph databases, including molecular databases. The molecular databases used were

the IC93 (1283 compounds), [ 12 ] the HIV assays 1999 (42 689 compounds) [ 13 ] and the NCI

(237 771 compounds). [ 14 ]

The IC93 database served to investigate how the algorithms behaved when the number of

found fragments and the fragments themselves become large. For example, a support value

of 4% resulted in 37 727 fragments, of which the largest had 22 bonds. The HIV database

served to measure performance, whereas the NCI was used to test how the algorithms scale

with increasing database size. For this, molecules were randomly divided into sets of various

Fragment-Based Drug Discovery

Search WWH ::

Custom Search

Home