Biology Reference
In-Depth Information
Chapter 5
BLAST and FASTA Similarity Searching for Multiple
Sequence Alignment
William R. Pearson
Abstract
BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA
sequences based on excess sequence similarity. If two sequences share much more similarity than expected
by chance, the simplest explanation for the excess similarity is common ancestry—homology. The most
effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that
encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST
and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA
sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to
protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA
searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be
downloaded and installed on local computers. With local installation, target databases can be customized
for the sequence data being characterized. With today's very large protein databases, search sensitivity can
also be improved by searching smaller comprehensive databases, for example, a complete protein set from
an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target
for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that
seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both
BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein
sequences that diverged more than 2 billion years ago.
Key word BLAST, FASTA, Homology, Similarity, Expectation value, Scoring matrices
1
Introduction
Identification of homologous sequences is an essential first step
before Multiple Sequence Alignment. If multiply aligned sequences
are not homologous, their alignment has no biological meaning.
Unfortunately, Multiple Sequence Alignments do not provide the
measures of statistical significance that are required to infer homol-
ogy. The selection of a set of sequences for multiple alignment
presumes that they are homologous; in this chapter we will discuss
the inference of homology from sequence similarity searches using
the popular programs BLAST and FASTA.
Search WWH ::




Custom Search