Database Reference
In-Depth Information
subgraph, 15 or using weighted substructure matching between two graphs. 14
These similarity measures are then used as kernels, and the SAR models are
built using kernel-based learning algorithms that include kernel PCA, 16 kernel
Fisher discriminant analysis, 16 and support vector machines. 2 The advantage
of these techniques is that they directly compute similarity values for all pairs
of compounds and eliminate the step of generating descriptors for chemical
compounds. However, such methods are inherently less descriptive as it is not
possible to analyze the different features of the molecular graph that might
be important for activity.
8.3.2 Structural Descriptors for Chemical Compounds
Descriptor-based representations of chemical compounds are used extensively
in cheminformatics, as they represent a convenient and computationally e-
cient way to capture key characteristics of the structure of the compounds.
Such representations have extensive applications to similarity search and var-
ious structure-driven prediction problems (e.g., activity, toxicity, absorption,
distribution, metabolism, and excretion). As part of our research, we have
developed novel structural descriptors corresponding to the set of frequent
topological subgraphs 17 and all bounded-size subgraphs 11 that are present
in a compound library. The key advantage of these descriptors is that un-
like structural descriptors that use path-based (e.g., Daylight 6 ) or extended
connectivity fingerprints, 2
they impose no limit on the complexity of the de-
scriptor's structure. 11 , 18
8.3.2.1
Descriptors Based on Bounded-Size Subgraphs
We have developed a new descriptor space that consists of all connected sub-
graphs up to a given length l that exist in a compound library. 11 We refer to
this descriptor space as graph fragments (GF). This descriptor space is deter-
mined dynamically from the library (i.e., a database of chemical compounds)
and consists of features that have complex topologies. We have also devel-
oped an algorithm to generate the GF descriptors of a library by eciently
enumerating all bounded-size subgraphs using a recursive technique initially
developed for generating all the spanning trees in a graph. 11 Moreover, the
algorithm uses the ecient canonical labeling algorithm that we developed 17
to identify isomorphic fragments in the same graph. In addition, if desired, the
fragment's canonical labeling is also used to count the number of occurrences
(embeddings) of a fragment in a molecular graph.
We have also identified the key characteristics that make certain two-
dimensional descriptors better than others. In order to establish these char-
acteristics, we compared the GF descriptors and three of its subsets (AF,
TF, and PF) 11 to five previously developed descriptors (Chemaxon's finger-
prints (fp), 9 extended connectivity fingerprints (ECFP), 2 Maccs keys (MK), 10
Cycles and Trees (CT), 19
and frequent subgraph-based descriptors (FS). 17 , 18
Search WWH ::




Custom Search