Chemistry Reference
In-Depth Information
Table 8.1 Comparison of
four
frequent
substructure-mining algorithms
in terms of
performance. a
Algorithm
Runtime (minimum support value)
Memory usage (GB)
(minimum support value)
5%
20%
5%
20%
Total
Per fragment
Total
Per fragment
(min)
found (s)
(min)
found (s)
Gaston [10]
7.4
0.1
2.5
0.4
1.3
0.9
gSpan [8]
19
0.3
4.5
0.8
0.3
0.2
FFSM [9]
19
0.3
8.3
1.5
1.2
0.9
MoFa [7]
80
1.1
11.8
2.2
0.6
0.6
a Performance was measured by applying each algorithm to the NCI HIV database (42 689 compounds). Runtime and
memory usage are provided for two support thresholds: 5 and 20%. The runtime per fragment found is also provided
to correct for the runtime overhead due to the higher number of fragments at lower support values. Data taken from
performance charts from Wörlein et al. [11]
sizes. Sample measurements are provided for illustrating the quantitative comparison of the
algorithms. Table 8.1 lists the performance measurements for the algorithms applied to the
HIV data. The runtime of the algorithms increases with lower support values. Gaston was
the fastest and MoFa the slowest algorithm. However, Gaston used the largest amount of
memory, whereas MoFa needed less. gSpan had the lowest memory requirements. Note that
these figures may differ for other data sets. Size and contents of the database, the minimum
support value, and also implementation details and even the underlying hardware architec-
ture may influence the performance of the algorithm. The data in Table 8.1 are indicative
of the overall outcome of the quantitative comparison. For all algorithms, lower support
values resulted in an exponential increase in runtime. This is probably due to the runtime
overhead caused by the exponential rise in found fragments at lower support values. The
benchmark results permitted a ranking of methods. gSpan needed the least memory, since it
does not use embedding lists. MoFa, which stores only one subgraph embedding per node
in the search tree, was also memory efficient. FFSM required more memory than gSpan and
MoFa, probably because it stores the main subgraphs together in a node in the search tree.
Gaston needed most memory, since with this method embedding lists for new fragments
are based on those of 'parent' fragments. Extensions to the parent's list are stored with the
'children'. The size of embedding lists also depends on the number of children per fragment.
In terms of runtime, Gaston was always the fastest algorithm, except at lower support
values on the complete NCI. The gSpan algorithm was faster than FFSM for the large
datasets, although FFSM was faster than gSpan for the IC93 dataset. Embedding lists are
not used in gSpan, which, in fact, speed up testing, especially for larger fragments. MoFa
was always the slowest algorithm. The authors suggested that the slowdown of Gaston at
lower support values on the complete NCI was due to the large amount of bookkeeping
related to the vast number of embeddings. This results in a slowdown due to memory
operations. However, the authors found that this effect varies for different systems. Some
memory architectures penalize thememory-intensive operations of Gaston.AlthoughMoFa
was the slowest in all tests, it offers more functionality for molecular databases, e.g. there is
an extension for treating rings as single entities, as mentioned above. [ 15 ] Another extension
 
Search WWH ::




Custom Search