Computational Approaches to Fragment and Substructure Discovery and Evaluation - Fragment-Based Drug Discovery

Chemistry Reference

In-Depth Information

Table 8.1 Comparison of

four

frequent

substructure-mining algorithms

in terms of

performance. a

Algorithm

Runtime (minimum support value)

Memory usage (GB)

(minimum support value)

20%

Total

Per fragment

Total

Per fragment

(min)

found (s)

(min)

found (s)

Gaston [10]

7.4

0.1

2.5

0.4

1.3

0.9

gSpan [8]

0.3

4.5

0.8

0.3

0.2

FFSM [9]

0.3

8.3

1.5

1.2

0.9

MoFa [7]

1.1

11.8

2.2

0.6

a Performance was measured by applying each algorithm to the NCI HIV database (42 689 compounds). Runtime and

memory usage are provided for two support thresholds: 5 and 20%. The runtime per fragment found is also provided

to correct for the runtime overhead due to the higher number of fragments at lower support values. Data taken from

performance charts from Wörlein et al. [11]

sizes. Sample measurements are provided for illustrating the quantitative comparison of the

algorithms. Table 8.1 lists the performance measurements for the algorithms applied to the

HIV data. The runtime of the algorithms increases with lower support values. Gaston was

the fastest and MoFa the slowest algorithm. However, Gaston used the largest amount of

memory, whereas MoFa needed less. gSpan had the lowest memory requirements. Note that

these figures may differ for other data sets. Size and contents of the database, the minimum

support value, and also implementation details and even the underlying hardware architec-

ture may influence the performance of the algorithm. The data in Table 8.1 are indicative

of the overall outcome of the quantitative comparison. For all algorithms, lower support

values resulted in an exponential increase in runtime. This is probably due to the runtime

overhead caused by the exponential rise in found fragments at lower support values. The

benchmark results permitted a ranking of methods. gSpan needed the least memory, since it

does not use embedding lists. MoFa, which stores only one subgraph embedding per node

in the search tree, was also memory efficient. FFSM required more memory than gSpan and

MoFa, probably because it stores the main subgraphs together in a node in the search tree.

Gaston needed most memory, since with this method embedding lists for new fragments

are based on those of 'parent' fragments. Extensions to the parent's list are stored with the

'children'. The size of embedding lists also depends on the number of children per fragment.

In terms of runtime, Gaston was always the fastest algorithm, except at lower support

values on the complete NCI. The gSpan algorithm was faster than FFSM for the large

datasets, although FFSM was faster than gSpan for the IC93 dataset. Embedding lists are

not used in gSpan, which, in fact, speed up testing, especially for larger fragments. MoFa

was always the slowest algorithm. The authors suggested that the slowdown of Gaston at

lower support values on the complete NCI was due to the large amount of bookkeeping

related to the vast number of embeddings. This results in a slowdown due to memory

operations. However, the authors found that this effect varies for different systems. Some

memory architectures penalize thememory-intensive operations of Gaston.AlthoughMoFa

was the slowest in all tests, it offers more functionality for molecular databases, e.g. there is

an extension for treating rings as single entities, as mentioned above. [ 15 ] Another extension

Fragment-Based Drug Discovery

Search WWH ::

Custom Search

Home