Database Reference
In-Depth Information
SciTegic's Pipeline Pilot, 7 Tripos' SYBYL, 8 and Chemaxon's SCREEN 9 that
provide a wide range of capabilities including database management and
searching, compound filtering, physical-chemical property calculations, SAR
modeling, and visualization. Visualization techniques are covered in detail in
Chapter 9.
In the rest of this section we first review some of the current trends in chem-
informatics and then highlight some of the techniques that we developed for
representing chemical compounds, determining their similarity, and building
classification models. Finally, we outline some of the future research directions
in cheminformatics.
8.3.1 Trends in Cheminformatics Data Mining and Modeling
Calculation of similarity between chemical compounds is a fundamental task
in order to analyze cheminformatics data. The analysis includes, but is not
limited to, retrieving, mining, and building SAR models on the data. To per-
form this analysis effectively and eciently, many algorithms first convert the
2D/3D structure into descriptor space or descriptor representation 2 , 4 and then
apply various information retrieval, data-mining, statistical, and machine-
learning approaches on the transformed data. The descriptors employed range
from physiochemical property descriptors, 2 , 4 to topological descriptors derived
from the compound's molecular graph, 2 , 6 , 10 , 11 to 2D and 3D pharmacophore
descriptors that capture interactions important to protein-ligand binding. 2 , 4
Among them, hashed 2D descriptors corresponding to subgraphs of various
sizes and types (e.g., paths, trees, rings) are the most common and include
the extensively used Daylight fingerprints, 6 Chemaxon fingerprints 9 and the
extended connectivity fingerprints, 2 , 7 that have been recently implemented in
Scitegic's Pipeline Pilot. 7
Over the years, the approaches that have been employed to learn SAR mod-
els have evolved from the initial regression-based techniques used by Hansch
et al., to approaches that utilize more complex statistical model-estimation
procedures. These procedures include partial least squares (PLS), linear dis-
criminant analysis, Bayesian models, and approaches that employ various
machine-learning/pattern recognition methods such as recursive partitioning,
neural networks, and support vector machines. 2 , 4 Another class of methods
for building SAR models operate directly on the structure of the chemical
compound. These methods employ inductive logic programming (ILP) (such
as the WARMR system 12 ) or heuristics (such as the MultiCASE system 2 )to
automatically identify a small number of chemical substructures that relate to
their biological activity. These substructures are used as descriptors. Finally,
in recent years, a new class of machine-learning techniques has been developed
that builds SAR models that measure the similarity between two compounds
by operating directly on their molecular graphs. 13 , 14 These techniques measure
the similarity by using powers of adjacency matrices, 13
calculating Markov
random walks on the underlying graphs, 13
finding the maximum common
Search WWH ::




Custom Search