Biology Reference
In-Depth Information
infer non-homology (unrelatedness); additional analyses must be
done to identify the highest scoring unrelated sequence. One
strategy is to perform a “reverse” search with the candidate non-
homolog, particularly if the query or library (target) sequence come
from large protein families. If there are no sequences with signifi-
cant alignment scores shared by the initial query sequence in the
first search and the candidate non-homolog library sequence in
the second search, it is much less likely that they are homologous.
Alternatively, if the query and candidate non-homolog do not share
any domains annotated by Pfam [ 16 ] or other domain database,
they are probably not homologous.
BLAST and FASTA (and SSEARCH) calculate local sequence
alignments; the boundaries of the alignments are calculated to
maximize the similarity score. If the alignment were longer or
shorter, the similarity score would be worse. In contrast, global
similarity scores require that the alignment extend to the ends of
the aligned sequences. Local similarity scores will always be posi-
tive; global scores, even for proteins that contain homologous
domains, can be positive or negative. Local alignment scores have
been universally adapted for similarity searching for several reasons:
(1) the statistical theory for local similarity scores is well under-
stood; (2) local similarity scores can identify locally homologous
domains in different protein contexts; (3) local scores work well for
partial sequences; and (4) local sequences can be used to identify
homologous exons in long stretches of chromosomal DNA.
While statistically significant local sequence similarity can be
used to reliably infer homology, the overall homology of two
aligned proteins or DNA sequences does not guarantee that every
aligned residue-pair reflects homology, particularly at the ends
of the alignment. For local sequence alignments, the boundaries
of the alignment, i.e., whether it stops at residue n or residue n +5,
depends strongly on the scoring matrix. As discussed below
(Subheading 4.2 ), an evolutionarily “deep” or sensitive scoring
matrix (BLOSUM62 or BLOSUM50) will produce longer align-
ments than “shallow” scoring matrices (VT20, PAM30), even
between unrelated sequences. (Unrelated or random alignments
will not have statistically significant scores, but they will be longer
with “deep” matrices.) Because they depend on the scoring matrix,
alignment boundaries between two homologous domains flanked
by non-homologous regions do not always stop at the end of the
homology; the homologous alignment can be “over-extended”
into non-homologous sequence [ 17 ].
Homologous over-extension was first recognized in genomic
DNA sequence alignment [ 18 ]; more recently it was shown to be
the major cause of psiblast Position Specific Scoring Matrix
contamination [ 17 ], which can dramatically reduce the selectivity
of psiblast searches [ 3 , 17 ]. Versions of psiblast and
3.3 Establishing
Homology Boundaries
Search WWH ::




Custom Search