Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

time to the sifting of journals, the prospect of automated literature mining is alluring,

even if the process only generates a list of candidate data which must be further

refined manually.

For the purpose of computational feasibility, text mining is often performed only

on the abstracts of journal articles, although the continued increase in computational

power, and particularly the availability of eScience approaches such as Cloud com-

puting (discussed in more detail below) is rapidly making it more feasible to search

large corpuses of entire articles. Abstracts also have the advantage of usually being

freely available and relatively easy to download, making the generation of text min-

ing datasets relatively straightforward. Moreover, running an exhaustive analysis

once, and storing the results in a database can also reduce the demand for compute

resources for text mining. The text mining algorithms need then only be re-run on

new articles as they become available. This approach has been applied to the creation

of knowledge bases on topics such as the identification of bacterial enteropathogens

( Zaremba et al. , 2009 ), surveillance of the literature in the service of infectious dis-

ease control ( Sintchenko et al. , 2009 ), toxin-antitoxin loci in bacteria and archaea

( Shao et al. , 2011 ), information about integrative and conjugative elements found

in bacteria ( Bi et al. , 2012 ), and many others.

The range of algorithms applied to text mining in biology, and the details of their

use, are more than extensive enough to warrant a review of their own, and there have

been several useful contributions ( Zweigenbaum et al. , 2007; Evans and Rzhetsky,

2011; Ceci et al. , 2012 ).

The simplest approach to text mining is the key word search. Although this

method is trivially easy to implement, it generally performs poorly on molecular

biology articles, due to the complexity and redundancy in the terminology used in

the biomedical literature. The same concepts may be referred to in multiple ways,

while multiple concepts may have the same name. This problem is particularly acute

when it comes to gene identifiers; every database has its own form of identifier, and a

paper may use any of the many “standards”. Gene names may change over time, and

older papers often use different aliases from newer ones.

One approach to dealing with the retrieval of information from sources that may

use different terminology for the same concepts is the use of MeSH (Medical Subject

Headings). 8 MeSH is a controlled vocabulary, used for indexing articles in PubMed.

The terms are organised hierarchically, so that a user can search at different levels of

specificity ( Figure 2.15 ).

More complex approaches use semantics to try to parse papers into linguistically

meaningful units. The most widely used approach to literature mining is Natural

Language Processing (NLP), which has been an active area of research for several

decades ( Manning and Schutze, 1999 ). NLP has three parts: information retrieval,

assignm ent of semantics and information extraction.

8 http://www.ncbi.nlm.nih.gov/mesh/ .

Search WWH ::

Custom Search

Home