Chemistry Reference
In-Depth Information
The matches(A,B) function returns true when SMARTS B matches
SMILES A. It is sometimes useful to know how many times B matches A.
For this, a new function is defined: count _ matches(A,B) . It returns an
integer, possibly 0. For example, count _ matches('CC(O)C', 'C') returns
3. The SQL clause where count_matches(cansmi, '[F,Cl,Br,I]') > 2 will find all
structures having more than 2 halogen atoms. In later chapters, examples
will show how this function can be used to compute molecular properties
and screen structures that conform to Lipinski's Rule of 5. 11
Another useful SQL extension function is list _ matches(A,B) .
This returns an array of integers telling which atoms in SMILES A were
matched by SMARTS B. For example, list _ matches('CC(O)C', 'C')
returns the array {1,2,4}. This list can be used for additional processing of
the matches SMILES, for example, to color the matched atoms in a draw-
ing or viewing application.
7.5 SMILES and SMARTS Quirks
SMILES may be a friend, but like all friends they have quirks that one comes
to accept. The following quirks should be carefully considered before creat-
i ing a l a rge dat aba s e of st r uc t u r e s. A si mple de c i sion m ade ea rly i in t he de sig in
can prevent troublesome changes that might otherwise be required.
7.5.1 Hydrogen Atoms
One important issue is how SMILES and SMARTS process hydrogen atoms.
SMILES is almost always used without explicitly showing the hydrogen
atoms. This is possible in almost all organic structures because of the pre-
dictable valence and bonding patterns of almost all organic structures.
For example, propane is CCC. It is possible to write it as C([H])([H])([H])
C([H])([H]) C([H])([H])([H]), or even [CH3][CH2][CH3], but this is almost
never done because it is lengthy, requires more computer processing, and
does not provide any more real information than just CCC. The situation
for SMARTS is not that simple.
When CC is used as a SMILES, it means exactly ethane, exactly [CH3]
[CH3]. When CC is used as a SMARTS, it will of course match ethane,
but will also match any structure having a C-C single bond, regardless
of how many H atoms are also bonded to each C. This may be exactly
what was intended, but SMARTS can be more exact in what is meant.
For example, the SMARTS [CH][CH0] will only match structures having a
C-C single bond where one C has exactly one H atom and the other C has
none. When brackets are used for a C atom in SMILES, the assumptions
normally made about the valence and hydrogen count of the atom are not
used. The SMILES [CH][CH0] is a strange molecule indeed and is likely
an error if it is encountered.
Search WWH ::




Custom Search