Biology Reference
In-Depth Information
time to the sifting of journals, the prospect of automated literature mining is alluring,
even if the process only generates a list of candidate data which must be further
refined manually.
For the purpose of computational feasibility, text mining is often performed only
on the abstracts of journal articles, although the continued increase in computational
power, and particularly the availability of eScience approaches such as Cloud com-
puting (discussed in more detail below) is rapidly making it more feasible to search
large corpuses of entire articles. Abstracts also have the advantage of usually being
freely available and relatively easy to download, making the generation of text min-
ing datasets relatively straightforward. Moreover, running an exhaustive analysis
once, and storing the results in a database can also reduce the demand for compute
resources for text mining. The text mining algorithms need then only be re-run on
new articles as they become available. This approach has been applied to the creation
of knowledge bases on topics such as the identification of bacterial enteropathogens
( Zaremba et al. , 2009 ), surveillance of the literature in the service of infectious dis-
ease control ( Sintchenko et al. , 2009 ), toxin-antitoxin loci in bacteria and archaea
( Shao et al. , 2011 ), information about integrative and conjugative elements found
in bacteria ( Bi et al. , 2012 ), and many others.
The range of algorithms applied to text mining in biology, and the details of their
use, are more than extensive enough to warrant a review of their own, and there have
been several useful contributions ( Zweigenbaum et al. , 2007; Evans and Rzhetsky,
2011; Ceci et al. , 2012 ).
The simplest approach to text mining is the key word search. Although this
method is trivially easy to implement, it generally performs poorly on molecular
biology articles, due to the complexity and redundancy in the terminology used in
the biomedical literature. The same concepts may be referred to in multiple ways,
while multiple concepts may have the same name. This problem is particularly acute
when it comes to gene identifiers; every database has its own form of identifier, and a
paper may use any of the many “standards”. Gene names may change over time, and
older papers often use different aliases from newer ones.
One approach to dealing with the retrieval of information from sources that may
use different terminology for the same concepts is the use of MeSH (Medical Subject
Headings). 8 MeSH is a controlled vocabulary, used for indexing articles in PubMed.
The terms are organised hierarchically, so that a user can search at different levels of
specificity ( Figure 2.15 ).
More complex approaches use semantics to try to parse papers into linguistically
meaningful units. The most widely used approach to literature mining is Natural
Language Processing (NLP), which has been an active area of research for several
decades ( Manning and Schutze, 1999 ). NLP has three parts: information retrieval,
assignm ent of semantics and information extraction.
8 http://www.ncbi.nlm.nih.gov/mesh/ .
Search WWH ::




Custom Search