Utilizing open source software to facilitate communication of chemistry at RSC - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

1. a means of extracting chemical names from text and converting them

to electronic structure formats;

2. a means of displaying the resulting electronic structure diagrams in

an interface for the users;

3. a means of storing those chemicals separately from the article XML;

4. a means of fi nding non-structural chemical and biomedical terms in

the text.

Chemical names commonly contain punctuation, for example [2-({4-

[(4-fluorobenzyl)oxy]phenyl}sulfonyl)-1,2,3,4-tetrahydroisoquinolin-

3-yl](oxo)acetic acid, or spaces, like diethyl methyl bismuth, or both, and

hence cause signifi cant problems for natural language processing code

that has been written to handle newswire text or biomedical articles. For

this reason, the Sciborg project [6] required code that would identify

chemical names so that they would not interfere with further downstream

processing of text. Fortunately, a method for extracting chemical

structures out of text was already available. The OSCAR software

provided a collection of open source code components to meet the

explicitly chemical requirements of the Sciborg project. It delivered

components that determined whether text was chemical or not, RESTful

web services for the Chemistry Development Kit (CDK) [10], routines for

training language models, and, importantly, the OPSIN parser [11],

which lexes candidate strings of text and generates the corresponding

chemical structures. The original version of OPSIN produced in 2006

had numerous gaps but was still powerful enough to identify many

chemicals. We also used the ChEBI database [12] as the basis of a chemical

dictionary.

In order to display extracted chemical structures, the CDK was used

via OSCAR. Although the relevant routines were not entirely reliable

and, specifi cally, did not handle stereochemistry, they were good enough

to demonstrate the principle. Following the introduction of the

International Chemical Identifi er (InChI) [13], and clear interest by

various members of the publishing industry and software vendors in

supporting the new standard, it was decided to store the connection

tables as InChIs. The InChI code is controlled open source, open but

presently only developed as a single trunk of code by one development

team. The structures were stored as InChIs both in the article XML and

in a SQL Server database. For the non-structural chemical and biomedical

terms contained within the text, resources that were accessible to the

casual reader were identifi ed as being most appropriate. This application

was launched with the IUPAC Gold Book [14], which had recently been

Search WWH ::

Custom Search

Home