Database Reference
In-Depth Information
accomplishment of a protein's function. These procedures have generated a
wide variety of useful data that range from simple protein sequences to com-
plex high-throughput data, such as gene expression datasets and protein in-
teraction networks. These data offer different types of insights into a pro-
tein's function and related concepts. For instance, protein interaction data
shows which proteins come together to perform a particular function, while
the three-dimensional structure of a protein determines the precise sites to
which the interacting protein binds itself. Due to its utility, recent years have
seen the recording of this data in very standardized and professionally main-
tained databases such as SWISS-PROT, 33 MIPS, 34 DIP 35 and PDB. 36
However, the huge amount of data that has accumulated through these ex-
periments over the years has made biological discovery via manual analysis
tedious and cumbersome, and has led to the emergence of the field of bioinfor-
matics. Indeed, an increasingly accepted path for biological research is the cre-
ation of hypotheses by generating results from an appropriate bioinformatics
algorithm in order to narrow the search space and the subsequent experimental
validation of these hypotheses to reach the final conclusion. 31 , 37 Owing to this
change in ideology and the importance of understanding protein function, nu-
merous researchers have applied computational and mathematical techniques
to predict protein function and attempt to close the sequence-function gap.
Early approaches used sequence similarity tools such as BLAST 38 , 39 to trans-
fer functional annotation from the most similar proteins. Subsequently, several
other approaches have been proposed that utilize other types of biological data
for computational protein function prediction, such as gene expression data,
protein interaction networks, and phylogenetic profiles. These techniques en-
able the prediction of the functions of those proteins that cannot be reliably
annotated using sequence similarity-based techniques. 40 , 41 Table 8.2 summa-
rizes the general ideas used by several of these approaches, categorized by
the type of biological data they utilize. Additional details on several hundred
such techniques that have been published in this area in the last few years are
available in Pandy et al. 42 and other reviews on this topic. 43 - 46
As can be seen, many of the techniques listed in Table 8.2, such as classifi-
cation, clustering, and association analysis, 47 are drawn from the fields of data
mining and machine learning. Indeed, these techniques have produced among
the best results for this problem, some of which have even been experimentally
verified. 48 , 49
However, despite this progress, the gap between the number of known pro-
teins and those that have been functionally annotated is astounding, as illus-
trated by Figure 8.3. This gap is possibly due to several outstanding issues
that need to be convincingly addressed. Some of the general issues that have to
be addressed are the possibility of a protein's performing multiple functions,
and thus having multiple functional labels; the widely varying sizes of func-
tional classes with most classes being very small; hierarchical arrangement of
functional labels, such as in gene ontology; 50 and incompleteness and various
types and extents of noise in biological data.
Search WWH ::




Custom Search