Chemistry Reference
In-Depth Information
a substructure search and not just a direct structure lookup. For example,
the simplest molecule containing the cyano group is hydrogen cyanide.
Although the canonical SMILES for hydrogen cyanide is C#N and for
acetonitrile is CC#N, it is not possible to find all structures that contain a
cyano group simply using the SQL clause where smiles like '%C#N%' .
In any one SMILES, the cyano group is not always spelled C#N. For exam-
ple, the canonical SMILES for thiocyanic acid is C(#N)S and not SC#N,
even though both C(#N)S and SC#N represent thiocyanic acid.
Another SQL extension is needed that can understand the molecular
structural nature of the SMILES string and treat it like more than just a
text string. Suppose there is a function matches(A,B) that returns true
when structure A contains structure B. Both these structures could be rep-
resented as SMILES and the matches function itself would understand
the molecular nature properly. Then m atc h es('C(#N)S', 'C#N') would
be true as would m atc h es('SC#N', 'N#C') , as intended. The matches
function can be used to find all cyano-containing structures in a table
using an SQL clause such as where matches(cansmi, 'C#N') .
Sometimes the desired substructure is not as simple as a cyano group.
For example, to search for di-halogen-substituted carbons, one could use
an SQL clause where matches(cansmi, 'FCF') or matches(cansmi,
'FCBr') or …. This would continue this for all possible combinations of
all the halogens. This is tedious. Weininger 10 proposed yet another lan-
guage, SMiles ARbitrary Target Specification (SMARTS), to succinctly
specify substructural searches. A SMARTS for di-halogen-substituted
carbon is [F,Cl,Br,I]C[F,Cl,Br,I]. The comma-separated atomic symbols
within brackets allows any one of the atoms in the list. So the SQL clause
where matches(cansmi, '[F,Cl,Br,I]C[F,Cl,Br,I]') will accomplish
the search for di-halogen substituted carbons. There are many other oper-
ators and symbols defined for SMARTS. These allow specification of the
hydrogen atom count, heavy atom count, charge, bond types, and other
aspects of atoms and bonds in substructure searches.
The matches(A, B) function is properly defined having A represent a
structure using SMILES and B represent substructures using SMARTS. Of
course, B may also be a SMILES.* In this case, matches will be true when
B is a substructure of A. All structures in a table for which CC(O)C is a sub-
structure can be found by using the SQL clause Where matches(cansmi,
'CC(O)C') . All these structures are properly called superstructures , yet the
search itself is commonly called a substructure search , because it is a search
by substructure . Notice what happens if the arguments are reversed as in
matches('CC(O)C', cansmi) . All rows having cansmi as a substructure
of CC(O)C will be found. These are called fragments of CC(O)C, although
they could properly be called substructures of CC(O)C.
* Convince yourself that any valid SMILES is also a valid SMARTS, but not vice versa.
Search WWH ::




Custom Search