Biology Reference
In-Depth Information
Several groups have applied clustering to metagenomics data. When metage-
nomic data are clustered, proteins can be grouped together either on the basis of
their domains ( Corpet et al. , 1998; Bateman et al. , 2004 ) or their full sequences
( Haft et al. , 2003; Yooseph et al. , 2007 ). Filtering can then be applied to the clusters
to eliminate high-distance links within clusters and detect spurious ORFs ( Yooseph
et al. , 2008 ). Protein function prediction is then achieved using a guilt-by-association
approach.
8 DATA MINING WITH MICROBIAL DATA: PRACTICAL ISSUES
There is an enormous amount of microbiological data already in existence, with more
constantly generated by ever-improving and expanding technologies. Data mining
clearly has the potential to identify significant trends and patterns in this data.
However, biological data in general, and microbiological data in particular, pose
particular problems for data miners.
8.1 Noise
Biological data is inherently noisy. Biological noise arises from a wide range of
sources, not all of which are understood. Biological noise may be intrinsic , due to
stochasticity in processes such as transcription and translation, or extrinsic , arising
from fluctuations in the environment ( Elowitz, 2002; Swain et al. , 2002 ).
Intrinsic noise can arise in gene regulation, for example, when the chance of a TF
binding depends upon the number of TF molecules in the cell, meaning transcription
tends to occur in bursts, rather than as a consistent, predictable process ( McAdams
and Arkin, 1997 ). More than 80% of the genes in the E. coli chromosome express
fewer than a hundred copies each of their protein products per cell ( Guptasarma,
1995 ). The rates of transcription, translation, modification or degradation of RNAs
and proteins vary for different gene products ( Newman et al. , 2006 ). The number of
mRNA molecules in a cell is thus variable, even under the same environmental and
genetic conditions; and of course, genetically identical cells are exposed to subtly
different extrinsic factors—micro-fluctuations in temperature, pH, nutrient avail-
ability and crowding, even under apparently identical experimental conditions.
Implicit in the analysis of most microarray data is the general assumption that
mRNA levels are directly correlated with protein levels; however, this is not always
the case. These factors, and many others, lead to variability in the numbers of the
specific biomolecules that are physically present in cells. In addition to biological
sources, measurement processes introduce variability into data. No laboratory equip-
ment, including the human eye and brain, performs with 100% accuracy.
Microbes deal with noise either by overcoming its effects ( You et al. , 2004;
Austin et al. , 2006 ), or by incorporating it into day-to-day life ( Ross et al. , 1994;
Maheshri and O'Shea, 2007 ). Consequently, when a data miner considers a biolog-
ical dataset, it is never clear to what extent the noise is important. However, many
Search WWH ::




Custom Search