Biology Reference
In-Depth Information
powerful when applied to protein sequence analysis, as the observed
species sampling of sequences is usually much smaller than the total ran-
dom combinations of all amino acids, thereby enabling statistically sig-
nificant inference of common ancestry. With the aim of tracing
evolutionary relations among genes and inferring putative functions,
many techniques have been developed for sequence comparisons and
assessment of the statistical significance of the homology, ranging from
pairwise comparisons such as the renowned Smith-Waterman and
BLAST algorithms to profiles of multiple-sequence alignments including
position-specific scoring matrices and hidden Markov models. Shared
functions and ancestries are confined to groups of similar genes, and
homology assessment techniques can therefore be used to identify such
protein families. Importantly, the approaches to define such groups are
dependent on the objectives, particularly with respect to functional
domain or whole gene length analyses. For example, a particular protein
function could be associated with a specific stretch of amino acids, or a
domain, which can be shuffled among different genes through evolution.
Recognizing the sequence characteristics of this domain would fulfill an
objective to group all genes with potential for this function; however, not
all of the genes may share the ancestry outside of the given domain and
an evolutionary objective may require approaches based on whole-length
gene comparisons.
The approaches to define protein families on the basis of domains
rely on comparative data, where the pattern of selection highlighting the
identity of key amino acids is captured from the multiple sequence align-
ment. This is often achieved primarily through expert human curation, as
exemplified by the PROSITE, 12 SMART, 13 Pfam, 14 SCOP, 15 and CATH 16
databases. Many such resources have joined efforts to coordinate domain
annotation through the umbrella InterPro project, 17 and the unified
InterProScan 18 software has been used to compare protein families for a
number of genome projects. Without prior knowledge of protein
domains, definition of protein families may be achieved though unsuper-
vised clustering methods applied to all-against-all sequence comparisons.
As the number of required comparisons scales dramatically with the size
of the dataset, tentative cluster representatives may be used in order to
reduce the number of comparisons. Homology significance scoring such
Search WWH ::




Custom Search