BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

Chapter 5

BLAST and FASTA Similarity Searching for Multiple

Sequence Alignment

William R. Pearson

Abstract

BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA

sequences based on excess sequence similarity. If two sequences share much more similarity than expected

by chance, the simplest explanation for the excess similarity is common ancestry—homology. The most

effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that

encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST

and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA

sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to

protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA

searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be

downloaded and installed on local computers. With local installation, target databases can be customized

for the sequence data being characterized. With today's very large protein databases, search sensitivity can

also be improved by searching smaller comprehensive databases, for example, a complete protein set from

an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target

for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that

seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both

BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein

sequences that diverged more than 2 billion years ago.

Key word BLAST, FASTA, Homology, Similarity, Expectation value, Scoring matrices

1

Introduction

Identification of homologous sequences is an essential first step

before Multiple Sequence Alignment. If multiply aligned sequences

are not homologous, their alignment has no biological meaning.

Unfortunately, Multiple Sequence Alignments do not provide the

measures of statistical significance that are required to infer homol-

ogy. The selection of a set of sequences for multiple alignment

presumes that they are homologous; in this chapter we will discuss

the inference of homology from sequence similarity searches using

the popular programs BLAST and FASTA.

Search WWH ::

Custom Search

Home