Chemistry Reference
In-Depth Information
chapter 8
Molecular Fragments
and Fingerprints
8.1 Introduction
Simplified Molecular Input Line Entry System (SMILES) is a simple, yet com-
plete description of molecular structure that considers the atoms and bonds
in a molecule. Using unique canonical SMILES, an indexed table lookup of a
structure can be quickly done. For example, the SQL to lookup phenol is:
Select cansmi From atable Where cansmi=cansmiles('c1ccccc1O');
When the table contains unique canonical smiles in an indexed column
cansmi , and the cansmiles function provides the proper canonical
SMILES for phenol, this lookup is extremely fast.
It is often necessary to find all structures that contain a given substruc-
ture. Consider the substructure search to find all structures that contain
the phenol group. Using the matches function described in a previous
chapter, the SQL to carry out such a substructure search is:
Select cansmi From atable Where matches(cansmi,'c1ccccc1O');
This cannot make use of the index on the column cansmi . Every row of
the table must be examined to see if the matches function succeeds. This
is a time-consuming process compared to a direct, indexed lookup.
8.2 Fragments
One way to speed up a substructure search is to use a reduced representa-
tion of molecular structure and a corresponding alternative to the matches
function. If this reduced representation of molecular structure is suffi-
ciently simple and if the alternative matches function is sufficiently fast,
they can be used as a filter to quickly decide which rows need more careful
examination using the full matches function. Other rows for which the
reduced representation does not match can be quickly passed over.
One might use molecular formula as a simpler representation of molec-
ular structure. Ignoring H atoms, the molecular formula for phenol is C6O.
91
Search WWH ::




Custom Search