BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

infer non-homology (unrelatedness); additional analyses must be

done to identify the highest scoring unrelated sequence. One

strategy is to perform a “reverse” search with the candidate non-

homolog, particularly if the query or library (target) sequence come

from large protein families. If there are no sequences with signifi-

cant alignment scores shared by the initial query sequence in the

first search and the candidate non-homolog library sequence in

the second search, it is much less likely that they are homologous.

Alternatively, if the query and candidate non-homolog do not share

any domains annotated by Pfam [ 16 ] or other domain database,

they are probably not homologous.

BLAST and FASTA (and SSEARCH) calculate local sequence

alignments; the boundaries of the alignments are calculated to

maximize the similarity score. If the alignment were longer or

shorter, the similarity score would be worse. In contrast, global

similarity scores require that the alignment extend to the ends of

the aligned sequences. Local similarity scores will always be posi-

tive; global scores, even for proteins that contain homologous

domains, can be positive or negative. Local alignment scores have

been universally adapted for similarity searching for several reasons:

(1) the statistical theory for local similarity scores is well under-

stood; (2) local similarity scores can identify locally homologous

domains in different protein contexts; (3) local scores work well for

partial sequences; and (4) local sequences can be used to identify

homologous exons in long stretches of chromosomal DNA.

While statistically significant local sequence similarity can be

used to reliably infer homology, the overall homology of two

aligned proteins or DNA sequences does not guarantee that every

aligned residue-pair reflects homology, particularly at the ends

of the alignment. For local sequence alignments, the boundaries

of the alignment, i.e., whether it stops at residue n or residue n +5,

depends strongly on the scoring matrix. As discussed below

(Subheading 4.2 ), an evolutionarily “deep” or sensitive scoring

matrix (BLOSUM62 or BLOSUM50) will produce longer align-

ments than “shallow” scoring matrices (VT20, PAM30), even

between unrelated sequences. (Unrelated or random alignments

will not have statistically significant scores, but they will be longer

with “deep” matrices.) Because they depend on the scoring matrix,

alignment boundaries between two homologous domains flanked

by non-homologous regions do not always stop at the end of the

homology; the homologous alignment can be “over-extended”

into non-homologous sequence [ 17 ].

Homologous over-extension was first recognized in genomic

DNA sequence alignment [ 18 ]; more recently it was shown to be

the major cause of psiblast Position Specific Scoring Matrix

contamination [ 17 ], which can dramatically reduce the selectivity

of psiblast searches [ 3 , 17 ]. Versions of psiblast and

3.3 Establishing

Homology Boundaries

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home