Database Reference
In-Depth Information
TABLE 8.1
SAR performance of different descriptors
Datasets
GF
ECFP
fp
MK
FS
NCI1
0.33
0.32
0.30
0.29
0.27
NCI109
0.32
0.32
0.27
0.24
0.26
NCI123
0.27
0.27
0.25
0.24
0.23
NCI145
0.37
0.35
0.30
0.28
0.30
NCI167
0.07
0.06
0.06
0.04
0.06
NCI220
0.29
0.28
0.33
0.26
0.21
NCI33
0.33
0.31
0.26
0.26
0.25
NCI330
0.36
0.36
0.34
0.31
0.24
NCI41
0.36
0.36
0.25
0.28
0.30
NCI47
0.31
0.31
0.26
0.26
0.24
NCI81
0.28
0.28
0.27
0.25
0.24
NCI83
0.31
0.31
0.26
0.26
0.25
The numbers correspond to the ROC 50 values of SVM-based SAR models for twelve screen-
ing assays obtained from the National Cancer Institute (NCI). The ROC 50 value is the area
under the receiver-operating characteristic curve (ROC) up to the first 50 false positives.
These values were computed using a five-fold cross-validation approach. The descriptors
being evaluated are graph fragments (GF), 11 extended connectivity fingerprints (ECFP), 2
Chemaxon's fingerprints (fp) (Chemaxon Inc.), 9
Maccs keys (MK) (MDL Information Sys-
tems Inc.), 10 and frequent subgraphs (FS). 18
We observed that descriptors that are determined dynamically from the
dataset and use fragments with simple and complex topologies lead to precise
representations. In addition, they have a high degree of coverage and may be
expected to perform better in the context of chemical compound classifica-
tion and retrieval as they allow for a better representation of the underlying
compounds. 11 The descriptor space that satisfies all the desirable characteris-
tics is GF. 11 ECFP virtually satisfies all of the characteristics except precise
representation since there is the possibility of collisions, although in practice
it is quite low. The quantitative and statistical results on the performance
of each of these descriptors on 28 datasets were found to be consistent with
our qualitative analysis. 11 Table 8.1 shows a subset of our results for the NCI
datasets obtained from the PubChem Project. 20 These results show that the
GF descriptor space achieves a performance that is either better or compa-
rable to that achieved by currently used descriptors, indicating that the GF
descriptors can effectively capture the structural characteristics of the com-
pounds.
8.3.3 Indirect Similarity Measures for Similarity Searching
and Scaffold Hopping
The task of searching a library to find compounds that are similar to a query
is extensively performed in cheminformatics. These compounds are termed as
hit compounds or hits . In order to identify these hits, the methods employed
Search WWH ::




Custom Search