Molecular Fragments and Fingerprints - Design and Use of Relational Databases in Chemistry

Chemistry Reference

In-Depth Information

8.2.2 MACCS Keys and Other Fragment Keys

One popular set of fragments has been published by MDL. 1 It is commonly

known as the MACCS public 166 keys. The Appendix shows a table of 164

rows, analogous to the fragments table defined and used in the example in

the previous section. Using this table and a slight modification to the key

function defined above, a public166keys function can be easily defined.

That function is also contained in the Appendix.

Any other set of useful fragments can be created and used as a frag-

ment key to prescreen rows in a table during a substructure match. The

public166keys table contains entries for every element in the periodic

table, although the bulk of the table is designed to distinguish various

organic compounds from one another. In a database containing a majority

of other types of compounds, a different set of fragment keys is appro-

priate. The point here is not to provide the best set of fragment keys or

even to recommend one set over another, but rather to illustrate a general

method for computing fragment keys using simple SQL and a relational

database table to define the fragments. The advantage of this approach is

that the algorithm and code to produce the fragment key is not contained

in some external program. It is an integral part of the database with the

fragment table clearly exposed for verification, modification, and use in

other ways.

8.3 Fingerprints

Another approach for generating bit string keys does not use a table for

fragments at all. Instead, it uses an algorithm to fragment each structure

and record each fragment as a bit pattern. Rather than assign each frag-

ment to a particular bit number as is done in the fragment key tables

above, some method of encoding each fragment is used. One approach is

to use the SMILES string that represents the fragment and apply a hash

function 2 to produce a fingerprint.

One method for producing these fragments first considers each atom

as a fragment of size one, similar to the molecular formula approach

described above. Then considering atoms bonded to each atom produces

two-atom fragments. Multiple-atom fragments are then produced. Using

this approach exhaustively and following every bond of every atom would

produce every possible fragment of every possible size for each structure.

This would be a large number of fragments, even for reasonably sized

structures. The number of bits required to store this information would

be correspondingly large. At some point, the size and complexity of the

bit string representation would make the prescreening process too slow

to be useful. To avoid that possibility, an upper limit on the size of each

fragment is imposed.

Search WWH ::

Custom Search

Home