Biomedical Engineering Reference
In-Depth Information
1. a means of extracting chemical names from text and converting them
to electronic structure formats;
2. a means of displaying the resulting electronic structure diagrams in
an interface for the users;
3. a means of storing those chemicals separately from the article XML;
4. a means of fi nding non-structural chemical and biomedical terms in
the text.
Chemical names commonly contain punctuation, for example [2-({4-
[(4-fluorobenzyl)oxy]phenyl}sulfonyl)-1,2,3,4-tetrahydroisoquinolin-
3-yl](oxo)acetic acid, or spaces, like diethyl methyl bismuth, or both, and
hence cause signifi cant problems for natural language processing code
that has been written to handle newswire text or biomedical articles. For
this reason, the Sciborg project [6] required code that would identify
chemical names so that they would not interfere with further downstream
processing of text. Fortunately, a method for extracting chemical
structures out of text was already available. The OSCAR software
provided a collection of open source code components to meet the
explicitly chemical requirements of the Sciborg project. It delivered
components that determined whether text was chemical or not, RESTful
web services for the Chemistry Development Kit (CDK) [10], routines for
training language models, and, importantly, the OPSIN parser [11],
which lexes candidate strings of text and generates the corresponding
chemical structures. The original version of OPSIN produced in 2006
had numerous gaps but was still powerful enough to identify many
chemicals. We also used the ChEBI database [12] as the basis of a chemical
dictionary.
In order to display extracted chemical structures, the CDK was used
via OSCAR. Although the relevant routines were not entirely reliable
and, specifi cally, did not handle stereochemistry, they were good enough
to demonstrate the principle. Following the introduction of the
International Chemical Identifi er (InChI) [13], and clear interest by
various members of the publishing industry and software vendors in
supporting the new standard, it was decided to store the connection
tables as InChIs. The InChI code is controlled open source, open but
presently only developed as a single trunk of code by one development
team. The structures were stored as InChIs both in the article XML and
in a SQL Server database. For the non-structural chemical and biomedical
terms contained within the text, resources that were accessible to the
casual reader were identifi ed as being most appropriate. This application
was launched with the IUPAC Gold Book [14], which had recently been
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search