Scientific Data Analysis - Scientific Data Management

Database Reference

In-Depth Information

subgraph, 15 or using weighted substructure matching between two graphs. 14

These similarity measures are then used as kernels, and the SAR models are

built using kernel-based learning algorithms that include kernel PCA, 16 kernel

Fisher discriminant analysis, 16 and support vector machines. 2 The advantage

of these techniques is that they directly compute similarity values for all pairs

of compounds and eliminate the step of generating descriptors for chemical

compounds. However, such methods are inherently less descriptive as it is not

possible to analyze the different features of the molecular graph that might

be important for activity.

8.3.2 Structural Descriptors for Chemical Compounds

Descriptor-based representations of chemical compounds are used extensively

in cheminformatics, as they represent a convenient and computationally e-

cient way to capture key characteristics of the structure of the compounds.

Such representations have extensive applications to similarity search and var-

ious structure-driven prediction problems (e.g., activity, toxicity, absorption,

distribution, metabolism, and excretion). As part of our research, we have

developed novel structural descriptors corresponding to the set of frequent

topological subgraphs 17 and all bounded-size subgraphs 11 that are present

in a compound library. The key advantage of these descriptors is that un-

like structural descriptors that use path-based (e.g., Daylight 6 ) or extended

connectivity fingerprints, 2

they impose no limit on the complexity of the de-

scriptor's structure. 11 , 18

8.3.2.1

Descriptors Based on Bounded-Size Subgraphs

We have developed a new descriptor space that consists of all connected sub-

graphs up to a given length l that exist in a compound library. 11 We refer to

this descriptor space as graph fragments (GF). This descriptor space is deter-

mined dynamically from the library (i.e., a database of chemical compounds)

and consists of features that have complex topologies. We have also devel-

oped an algorithm to generate the GF descriptors of a library by eciently

enumerating all bounded-size subgraphs using a recursive technique initially

developed for generating all the spanning trees in a graph. 11 Moreover, the

algorithm uses the ecient canonical labeling algorithm that we developed 17

to identify isomorphic fragments in the same graph. In addition, if desired, the

fragment's canonical labeling is also used to count the number of occurrences

(embeddings) of a fragment in a molecular graph.

We have also identified the key characteristics that make certain two-

dimensional descriptors better than others. In order to establish these char-

acteristics, we compared the GF descriptors and three of its subsets (AF,

TF, and PF) 11 to five previously developed descriptors (Chemaxon's finger-

prints (fp), 9 extended connectivity fingerprints (ECFP), 2 Maccs keys (MK), 10

Cycles and Trees (CT), 19

and frequent subgraph-based descriptors (FS). 17 , 18

Scientific Data Management

Search WWH ::

Custom Search

Home