Biomedical Engineering Reference
In-Depth Information
result, it could reference the native C++ OpenBabel library in the same way
as the main ChemSpider code - via a wrapper managed C++ assembly
(OBNET), which only exposed functionality required for the Digester and
ChemSpider. The real advantage of OpenBabel being open source is that
source code can be adjusted and the assembly recompiled, allowing
adjustments required to deal with the real ChemDraw fi les from authors.
These adjustments primarily involved adding new functionality, such as
a new 'splitter' function to split ChemDraw fi les that contain multiple
'fragment' objects (molecules) into separate ChemDraw fi les so that each
could be processed separately. Another issue is that the ChemDraw format
supports more features than MOL, so some information is lost in this
conversion. As a result, the ChemDraw reader had to be adjusted to read
in this information and store it in the associated data fi elds in the SDF fi le
generated - for example special bond types are represented by the PubChem
notation. The other more important example of data lost from the original
ChemDraw fi le is that of text labels associated with molecules. The diffi culty
in this case is to defi ne how to match up a structure with its label. As a fi rst
step, OpenBabel was adjusted to recognize text labels that had been
specifi cally grouped with a particular structure by the author. However, it
became clear that in practice authors rarely used this grouping feature for
this purpose, so that the vast majority of labels in the fi gures would be lost.
The ChemDraw Digester incorporates a review step where the digested
information can be reviewed in an editable web page as shown in
Figure 3.4. If a label is wrong or absent it can be amended but this is a
time-consuming process, and the ultimate aim for the ChemDraw
Digester is that it could be run as a fully automated process that does not
require human intervention.
As an alternative to manual correction, the OpenBabel source code
was modifi ed to return labels for structures based on proximity, as well
as grouping. A function was added which was called when a 'fragment'
object (molecule or atom) was found. The function calculates the distance
between the fragment object and all of the 'text' objects in the fi le (based
on their 2D coordinates), so that the closest label to it could be identifi ed.
If the distance between the fragment and its closest text was less than the
distance between that same text and any other fragment, then the value
of the 'text' property of the text object (the text in the label) was associated
with the structure and returned in the SDF fi le produced. Certain checks
were also built in to ignore labels that do not contain any alphanumeric
characters (e.g. '+').
The ChemDraw Digester is presently in its fi nal stages of development
and testing, all of the processed structures in the SDF fi le will be reviewed
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search