Chemistry Reference
In-Depth Information
7.2 SMILES Representation
of Molecular Structure
SMILES (Simplified Molecular Input Line Entry System) was invented by
Weininger 5 to facilitate the representation and manipulation of molecular
structures using computers. It uses standard atomic symbols to represent
atoms and the symbols - for single bond, = for double bond, and # for
triple bond. Hydrogen atoms can be represented explicitly but are almost
always represented implicitly using normal conventions of valence bond
theory. Single bonds need not be explicitly written. For example, pro-
pane is C-C-C or simply CCC. Methylamine is CN, and C#N is hydrogen
cyanide. Propene is C=CC. For more complex structures with branched
bonds, parentheses are used. For example, CC(C)O is isopropyl alcohol,
whereas CCCO is propanol.
Notice that there are several ways in which SMILES could be writ-
ten for the same structure, even the simplest ones. For example, hydro-
gen cyanide can be written as C#N or N#C, propene is either C=CC or
CC=C. More complex structures can have three or many more SMILES
that represent the same structure. If there were one standard way to write
SMILES, then standard SQL text comparisons could be used to locate any
particular structure. SMILES would become a uniquely spelled “name”
for each unique structure. Canonical SMILES does just that. Using rules
about which atoms should come before other atoms in the spelling of each
SMILES, a unique name for each molecular structure can be provided. 6
Once there is a unique, canonical SMILES available, this can be stored
in a text column and a direct lookup for a specific structure can be done
using the SQL = operator. If canonical SMILES is stored in a text column
named cansmi , one can locate isopropyl alcohol using the SQL clause
Where cansmi = 'CC(C)O' . And because text data can be indexed in SQL,
this lookup is extremely fast. In addition, SQL uniqueness constraints can
be used to enforce data integrity when using canonical SMILES.
The rules for canonical SMILES are complex and not further discussed
here. There are many computer programs and structure-drawing applica-
tions that recognize and produce SMILES and canonical SMILES. There
are also many programs that can interconvert molecular structure files
and SMILES. To make full use of canonical SMILES in relational tables, it
is not sufficient to use external programs such as these to process SMILES.
There needs to be a way to integrate SMILES processing into the database
and into SQL itself. This can be accomplished using SQL extensions.
7.3 Extensions to SQL for Chemical Structures
Standard SQL data types, such as integers, float, and text, are useful
for storing scientific data, such as counts, measurements, and names.
Search WWH ::




Custom Search