Chemistry Reference
In-Depth Information
complexity and complicate (or even break) existing programs that need to
read and write such files. For example, PDB files can be difficult to read and
write, since there are many flavors of this “standard” file format with vari-
ous additions to satisfy the needs of various computer programs. There
are dozens of other molecular file formats, each with its own format.
One solution to the multitude of file formats for molecular structures
is to provide a common program to read and write each file type. 1 A good
example is Babel or OpenBabel. 2 A common data structure, internal to the
program, serves as a hub for storing and processing the molecular struc-
ture. Components can be added to allow new file formats to be read and
written. This approach shares some features with the RDBMS approach.
Each molecular file format corresponds to an external representation of
the molecular structure and the internal data structure corresponds to
the internal representation. In the RDBMS approach, the various file for-
mats are also the external representation of molecular structure, but the
common data structure is a schema with tables holding the molecular
structure information. The purpose of this chapter is to propose ways to
move away from file formats entirely, preserving only the ability to read
files formats for legacy data. A later section of this chapter will show how
molecule tables in an RDBMS can effectively be used instead of molecular
structure files by client programs.
11.3 Molfile and Other Common File Formats
The molfile or sdf file format is a very common way to store molecular
structures. This can be considered as an external representation of a
molecular structure data type. There are many other common file formats
in use and only the essential features common to all of them will be con-
sidered here. The essential aspects of molecular structure contained in
these files are atomic number or atomic symbol, formal atomic charge,
bonded atom pairs, and bond orders. These are the minimum attributes
necessary to define an unambiguous valence bond molecular structure.
Other atom properties, such as atom types might also occur in these files,
but these are specific to particular modeling programs and will not be dis-
cussed here. Sometimes molecular properties are also stored in these files.
A way to store these properties in relational tables is discussed.
It would be possible to create tables using columns to store the atomic
symbols and bond information found in molecular structure files, reflect-
ing the column style format of the file itself. Instead, a SMILES representa-
tion of this valence bond information is preferred. SMILES is a compact text
st ri ng contai n i ng t he same i n format ion as t he colu m ns of atom sy mbols a nd
bonds. It can also be used directly in the search functions described in ear-
lier chapters. It is desirable to parse the molecular properties in molecular
structure files in order to store them in data columns for possible searching
Search WWH ::




Custom Search