BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

While both BLAST and FASTA provide options for searching

a subset of a sequence database, FASTA provides the additional

ability to search a subset of a database, but then use the significant

library “hits” to the query sequence to align to additional sequences

by projecting the smaller database onto a larger, more comprehen-

sive (and possibly redundant) database. The -e expand.sh script

option specifies a script that can return an additional set of sequences

to be aligned, based on the sequences that were found in the initial

search. For example, if the initial search returns the sequence

GSTM1_HUMAN from a search of human UniProt proteins, and a

search of Swiss-Prot with human proteins identifies homologs

from other vertebrates, then the scores and alignments are shown

not only with the original GSTM1_HUMAN but also with the other

sequences, which were not present in the initial search, but were

linked and returned by the -e expand.sh expansion script. This

allows searches to be performed against small, representative

datasets, but return results as if the additional sequences were

included. The strategy can also be used to align mRNAs ( fastx -

e expand.sh ) against all known isoforms of a gene, after initially

searching only the canonical form of the protein.

Pairwise sequence similarity programs like blastp and fasta can

become less sensitive as database size increases, because larger

databases produce more high alignment scores by chance. Iterative

programs, like psiblast , can take advantage of the diversity in

large comprehensive database searches to dramatically improve

search sensitivity. Thus, while smaller databases can make

blastp / blastx and fasta / fastx / ssearch more effective,

psiblast performs best when used against larger databases, like

refseq_protein . If there are only a small number of very distant

homologs to the query, then smaller databases will be more effec-

tive. But if there are many homologs that lack useful annotations,

psiblast can sometimes build a sensitive PSSM that can find a

well-annotated homolog (however, very distantly related sequences

are less likely to share a function).

4.1.2 psiblast Works

Best with Large Databases

The BLAST and FASTA programs are optimized for identifying

distantly related sequences with full-length protein and gene-length

DNA sequences. Most investigators searching for homologs to

build a Multiple Sequence Alignment will do best by using the

default search parameters provided by blastp (BLOSUM62

scoring matrix,

4.2 Changing

Scoring Matrices

and Gap Penalties

1 for gap-open and gap-extend penalties) or

fasta / ssearch (BLOSUM50,

11/

2). blastp and fasta /

ssearch search parameters have been extensively evaluated over a

very wide range of evolutionary distances and query sets; changing

the parameters almost always reduces sensitivity.

The scoring matrix and gap penalties should be changed (1) for

searches with partial-length (short) query sequences and (2) when

10/

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home