Utilizing open source software to facilitate communication of chemistry at RSC - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

result, it could reference the native C++ OpenBabel library in the same way

as the main ChemSpider code - via a wrapper managed C++ assembly

(OBNET), which only exposed functionality required for the Digester and

ChemSpider. The real advantage of OpenBabel being open source is that

source code can be adjusted and the assembly recompiled, allowing

adjustments required to deal with the real ChemDraw fi les from authors.

These adjustments primarily involved adding new functionality, such as

a new 'splitter' function to split ChemDraw fi les that contain multiple

'fragment' objects (molecules) into separate ChemDraw fi les so that each

could be processed separately. Another issue is that the ChemDraw format

supports more features than MOL, so some information is lost in this

conversion. As a result, the ChemDraw reader had to be adjusted to read

in this information and store it in the associated data fi elds in the SDF fi le

generated - for example special bond types are represented by the PubChem

notation. The other more important example of data lost from the original

ChemDraw fi le is that of text labels associated with molecules. The diffi culty

in this case is to defi ne how to match up a structure with its label. As a fi rst

step, OpenBabel was adjusted to recognize text labels that had been

specifi cally grouped with a particular structure by the author. However, it

became clear that in practice authors rarely used this grouping feature for

this purpose, so that the vast majority of labels in the fi gures would be lost.

The ChemDraw Digester incorporates a review step where the digested

information can be reviewed in an editable web page as shown in

Figure 3.4. If a label is wrong or absent it can be amended but this is a

time-consuming process, and the ultimate aim for the ChemDraw

Digester is that it could be run as a fully automated process that does not

require human intervention.

As an alternative to manual correction, the OpenBabel source code

was modifi ed to return labels for structures based on proximity, as well

as grouping. A function was added which was called when a 'fragment'

object (molecule or atom) was found. The function calculates the distance

between the fragment object and all of the 'text' objects in the fi le (based

on their 2D coordinates), so that the closest label to it could be identifi ed.

If the distance between the fragment and its closest text was less than the

distance between that same text and any other fragment, then the value

of the 'text' property of the text object (the text in the label) was associated

with the structure and returned in the SDF fi le produced. Certain checks

were also built in to ignore labels that do not contain any alphanumeric

characters (e.g. '+').

The ChemDraw Digester is presently in its fi nal stages of development

and testing, all of the processed structures in the SDF fi le will be reviewed

Search WWH ::

Custom Search

Home