Chemistry Reference
In-Depth Information
8.2.2 MACCS Keys and Other Fragment Keys
One popular set of fragments has been published by MDL. 1 It is commonly
known as the MACCS public 166 keys. The Appendix shows a table of 164
rows, analogous to the fragments table defined and used in the example in
the previous section. Using this table and a slight modification to the key
function defined above, a public166keys function can be easily defined.
That function is also contained in the Appendix.
Any other set of useful fragments can be created and used as a frag-
ment key to prescreen rows in a table during a substructure match. The
public166keys table contains entries for every element in the periodic
table, although the bulk of the table is designed to distinguish various
organic compounds from one another. In a database containing a majority
of other types of compounds, a different set of fragment keys is appro-
priate. The point here is not to provide the best set of fragment keys or
even to recommend one set over another, but rather to illustrate a general
method for computing fragment keys using simple SQL and a relational
database table to define the fragments. The advantage of this approach is
that the algorithm and code to produce the fragment key is not contained
in some external program. It is an integral part of the database with the
fragment table clearly exposed for verification, modification, and use in
other ways.
8.3 Fingerprints
Another approach for generating bit string keys does not use a table for
fragments at all. Instead, it uses an algorithm to fragment each structure
and record each fragment as a bit pattern. Rather than assign each frag-
ment to a particular bit number as is done in the fragment key tables
above, some method of encoding each fragment is used. One approach is
to use the SMILES string that represents the fragment and apply a hash
function 2 to produce a fingerprint.
One method for producing these fragments first considers each atom
as a fragment of size one, similar to the molecular formula approach
described above. Then considering atoms bonded to each atom produces
two-atom fragments. Multiple-atom fragments are then produced. Using
this approach exhaustively and following every bond of every atom would
produce every possible fragment of every possible size for each structure.
This would be a large number of fragments, even for reasonably sized
structures. The number of bits required to store this information would
be correspondingly large. At some point, the size and complexity of the
bit string representation would make the prescreening process too slow
to be useful. To avoid that possibility, an upper limit on the size of each
fragment is imposed.
Search WWH ::




Custom Search