Database Reference
In-Depth Information
In this section, we discuss some approaches that have been proposed for
handling one of the above issues, namely preprocessing of biological datasets to
reduce their incompleteness and the noise contained in them. We also discuss
some further research issues that should be addressed in this direction.
8.4.1 Preprocessing of Biological Datasets to Enhance
Function Prediction
Biological datasets usually contain significant amounts of noise, 51 which may
hamper their use for making accurate inferences about biological processes and
protein function. This noise may arise from inaccuracies in the experimental
methods used to generate the data, or in the subsequent data analysis methods
to process the data generated into a more usable form. In particular, the
problem of noise is further pronounced for data generated by high-throughput
experimental methods, such as protein interaction and microarray data. For
instance, it has been reported in a recent survey of the yeast and human
interactomes that most of the available protein interaction datasets have an
exceedingly high fraction of false positive interactions—up to almost 80% for
some datasets. 52 Similarly, a slightly less acknowledged but equally important
problem with biological data is that of incompleteness. Even for the most
well studied organisms, experimental data for several biological processes is
generally unavailable. For instance, Hart et al. have estimated that for the
commonly used model organism S. cerevisiae , only 50% of all viable protein-
protein interactions are known, while for the human genome, this number is
as small as 11%. 52 This incompleteness may delay the discovery of proteins
involved in the processes represented by the missing data. This illustrates
that the problem of noise and incompleteness in biological datasets needs to
be adequately addressed in order to ensure that accurate inferences about
protein function are drawn from them.
We illustrate the efforts in preprocessing of biological data through the work
done on noise quantification and elimination in protein interaction datasets.
Several methods have been proposed for estimating the quality of a dataset
consisting of direct protein-protein interactions, such as the EPR (expression
profile reliability) index. 53 This method estimates the reliability of the input
interaction dataset by comparing the distribution of the correlations between
the expression values of the constituent proteins with those of the proteins
constituting the DIPCore dataset, 53 which is a set of about 5,000 highly re-
liable interactions in the DIP database 35 and is treated here as the reference
dataset. Although this estimation of the reliability of an entire dataset is
useful, it is often the case that some interactions in the dataset are more
reliable than others. Hence, it is very useful to estimate the reliabilities of in-
dividual interactions. A popular tool, known as PVM (paralogous verification
Available online at http://dip.doe-mbi.ucla.edu/dip/Services.cgi?SM=1. Accessed July 12, 2008.
Search WWH ::




Custom Search