Chemistry Reference
In-Depth Information
Every structure containing phenol as a substructure must have a molecu-
lar formula with 6 or more C atoms and 1 or more O atoms. Structures with
fewer C or O atoms can be immediately ruled out as possible matches for
phenol. Of the remaining structures, there will be some that satisfy the
molecular formula comparison yet do not match phenol. The more time-
consuming matches function will be used only for the final determination.
Overall, the process of finding substructure matches will be faster. Exactly
how much faster depends on the number of rows that can be quickly ruled
out using the faster molecular formula comparison. It also depends, of
course, on how fast the molecular formula comparison can be done.
One way to do a quick molecular formula comparison is to store the
molecular formula not as a string representation, such as C6O, but as a
column of integers. Each row in a table of molecular structures would con-
tain SMILES, but the table would also have additional columns containing
the count of each atom type. These columns could be indexed to speed up
the molecular formula comparison. The SQL used to search for structures
containing phenol becomes as follows:
Select cansmi From atable Where C_count>=6 and O_count>=1
And matches(cansmi, 'c1ccccc1O');
The columns C _ count and O _ count would have been precomputed
when the row for each molecular structure was added to the table.
Because every molecular structure is composed of atoms, the atom
counts corresponding to molecular formula form a complete set of molec-
ular fragments. However, the atom counts are not a very discriminating
filter. Another approach is to construct a set of molecular fragments that
are complex enough to discriminate various structures from one another
yet simple enough to be used for fast filtering before using the full matches
function.
Constructing a useful set of molecular fragments requires knowledge
of the types of structures that will appear in the database. This will be
discussed in a later section of this chapter. First, consider how such a set of
fragments can be used to filter structures during a substructure search.
8.2.1 Fragment Keys
Suppose a representative set of N fragments has been defined. A bit
string* containing N bits can be used to represent the presence or absence
of each fragment in any molecular structure. This alternative representa-
tion of molecular structure is called a fragment key . It can be used as a filter
* The bit or bit varying data type in standard SQL will be used in the examples in this and fol-
lowing chapters. Oracle does not support this data type. PostgreSQL syntax will be used.
Search WWH ::




Custom Search