Scientific Data Analysis - Scientific Data Management

Database Reference

In-Depth Information

SciTegic's Pipeline Pilot, 7 Tripos' SYBYL, 8 and Chemaxon's SCREEN 9 that

provide a wide range of capabilities including database management and

searching, compound filtering, physical-chemical property calculations, SAR

modeling, and visualization. Visualization techniques are covered in detail in

Chapter 9.

In the rest of this section we first review some of the current trends in chem-

informatics and then highlight some of the techniques that we developed for

representing chemical compounds, determining their similarity, and building

classification models. Finally, we outline some of the future research directions

in cheminformatics.

8.3.1 Trends in Cheminformatics Data Mining and Modeling

Calculation of similarity between chemical compounds is a fundamental task

in order to analyze cheminformatics data. The analysis includes, but is not

limited to, retrieving, mining, and building SAR models on the data. To per-

form this analysis effectively and eciently, many algorithms first convert the

2D/3D structure into descriptor space or descriptor representation 2 , 4 and then

apply various information retrieval, data-mining, statistical, and machine-

learning approaches on the transformed data. The descriptors employed range

from physiochemical property descriptors, 2 , 4 to topological descriptors derived

from the compound's molecular graph, 2 , 6 , 10 , 11 to 2D and 3D pharmacophore

descriptors that capture interactions important to protein-ligand binding. 2 , 4

Among them, hashed 2D descriptors corresponding to subgraphs of various

sizes and types (e.g., paths, trees, rings) are the most common and include

the extensively used Daylight fingerprints, 6 Chemaxon fingerprints 9 and the

extended connectivity fingerprints, 2 , 7 that have been recently implemented in

Scitegic's Pipeline Pilot. 7

Over the years, the approaches that have been employed to learn SAR mod-

els have evolved from the initial regression-based techniques used by Hansch

et al., to approaches that utilize more complex statistical model-estimation

procedures. These procedures include partial least squares (PLS), linear dis-

criminant analysis, Bayesian models, and approaches that employ various

machine-learning/pattern recognition methods such as recursive partitioning,

neural networks, and support vector machines. 2 , 4 Another class of methods

for building SAR models operate directly on the structure of the chemical

compound. These methods employ inductive logic programming (ILP) (such

as the WARMR system 12 ) or heuristics (such as the MultiCASE system 2 )to

automatically identify a small number of chemical substructures that relate to

their biological activity. These substructures are used as descriptors. Finally,

in recent years, a new class of machine-learning techniques has been developed

that builds SAR models that measure the similarity between two compounds

by operating directly on their molecular graphs. 13 , 14 These techniques measure

the similarity by using powers of adjacency matrices, 13

calculating Markov

random walks on the underlying graphs, 13

finding the maximum common

Scientific Data Management

Search WWH ::

Custom Search

Home