Biology Reference
In-Depth Information
Homology has become a ubiquitous term in genome analysis
and computational biology, but the inference and implications of
homology—descent from a common ancestor—can be confusing.
Two protein or DNA sequences are either homologous or they are
not, but our ability to infer homology depends on context: the
particular sequences and programs used, the library selected, and
the statistical threshold chosen.
We infer that two sequences are “homologous” from excess
similarity. When two sequences share more similarity than would be
expected by chance, the most parsimonious explanation is that the
sequences diverged from a common ancestor. Thus, this simple and
widely accepted understanding of significance and homology has a
statistical foundation; we cannot infer homology without some
estimate of how often a similarity score might occur by chance.
The distribution of chance scores depends on the search context;
searches against large databases will produce higher scores on aver-
age, simply because there are more opportunities to produce a high
score by chance. Thus, a similarity score that is clearly significant
and provides strong evidence for homology in a search of the
human protein set (about 40,000 sequences) might not be signifi-
cant in the context of a search of 20,000,000 sequences, the current
size of the largest protein databases. Context dependence is one of
the several unsettling properties of homology inference; a statisti-
cally significant similarity score can be used to infer homology, but a
nonsignificant score cannot be used to infer non-homology.
Likewise, our ability to infer homology from similarity searches
depends on the query sequence used for the search. The signifi-
cance/nonsignificance problem is frequently encountered in
diverse protein families, where many members of a family share
significant similarity to one member of the family, but others do
not. A sequence from a highly populated part of a protein family
tree, e.g., human protein kinases, can easily detect thousands of
homologs with very significant scores, but a protein kinase from
slime mold may find only a few dozen clear homologs. Strategies
like psiblast that build models of protein families can reduce
these differences by capturing a much larger fraction of the mem-
bers of a family, but most diverse protein families will have members
that are hard to identify from sequence alone.
A Multiple Sequence Alignment can still make sense when not
every sequence shares significant similarity with every other
sequence in the multiple alignment, as long as some combination
of significant similarities can connect the members of the family.
In diverse families, a sequence A may share significant similarity
with family members B, C, and D, but not with E, F, and G. In this
case, if B, C, or D shares significant similarity with E, F, and G, then
one can infer that they all belong to the same family, and that a
Multiple Sequence Alignment makes biological sense.
Search WWH ::




Custom Search