Biology Reference
In-Depth Information
CHAPTER
2
J.S. Hallinan 1
School of Computing Science and Centre for Bacterial Cell Biology, Newcastle University,
Newcastle upon Tyne, United Kingdom
1 Corresponding author: e-mail address: j.s.hallinan@ncl.ac.uk
Data mining
for microbiologists
1 INTRODUCTION
Microbiologists are drowning in data. Gigabytes of genomic, transcriptomic, prote-
omic, metabolic and interaction data are produced every day. Most of this data is
eventually deposited in freely accessible databases, of which there are many: the
2012 Nucleic Acids Research Database issue reports on 1380 active online databases
containing information about everything from the composition of DNA sequences to
the details of protein complex formation ( Galperin and Fern´ndez-Su´rez, 2012 ).
Much of this data does not make it into the peer-reviewed literature. High-throughput
experiments generate a considerable amount of data that is not of primary interest to
the research group carrying out the experiments. For example, a microarray exper-
iment can provide a snapshot of mRNA levels for every gene in a genome, but in
general only those genes identified as being significantly up- or down-regulated
are of interest in the research domain of the authors. The rest of the data are usually
deposited in an appropriate database, but will not be published in the peer-reviewed
literature.
For those interested in the information captured in unpublished datasets, explor-
ing each database individually is prohibitively time consuming. Even if each data-
base search takes only 5 min, finding all of the stored information about a single
gene in those 1380 databases would take around 115 h of work. And that is without
even considering the additional time required to read and assimilate the relevant
literature.
Despite this embarrassment of information riches, a significant proportion of
genes in most species are un-annotated, or annotated only on the basis of similarity
to other genes. Probably the most well-studied microbe is the baker's yeast Saccha-
romyces cerevisiae , the first eukaryote to be sequenced ( Goffeau et al. , 1996 ).
S. cerevisiae has around 6200 open reading frames (ORFs). As of May 2012 the
Munich Information on Protein Sequences (MIPS) database 1
lists 661 of these as
1 http://mips.helmholtz-muenchen.de/genre/proj/yeast/ .
 
 
 
Search WWH ::




Custom Search