Biology Reference
In-Depth Information
While both BLAST and FASTA provide options for searching
a subset of a sequence database, FASTA provides the additional
ability to search a subset of a database, but then use the significant
library “hits” to the query sequence to align to additional sequences
by projecting the smaller database onto a larger, more comprehen-
sive (and possibly redundant) database. The -e expand.sh script
option specifies a script that can return an additional set of sequences
to be aligned, based on the sequences that were found in the initial
search. For example, if the initial search returns the sequence
GSTM1_HUMAN from a search of human UniProt proteins, and a
search of Swiss-Prot with human proteins identifies homologs
from other vertebrates, then the scores and alignments are shown
not only with the original GSTM1_HUMAN but also with the other
sequences, which were not present in the initial search, but were
linked and returned by the -e expand.sh expansion script. This
allows searches to be performed against small, representative
datasets, but return results as if the additional sequences were
included. The strategy can also be used to align mRNAs ( fastx -
e expand.sh ) against all known isoforms of a gene, after initially
searching only the canonical form of the protein.
Pairwise sequence similarity programs like blastp and fasta can
become less sensitive as database size increases, because larger
databases produce more high alignment scores by chance. Iterative
programs, like psiblast , can take advantage of the diversity in
large comprehensive database searches to dramatically improve
search sensitivity. Thus, while smaller databases can make
blastp / blastx and fasta / fastx / ssearch more effective,
psiblast performs best when used against larger databases, like
refseq_protein . If there are only a small number of very distant
homologs to the query, then smaller databases will be more effec-
tive. But if there are many homologs that lack useful annotations,
psiblast can sometimes build a sensitive PSSM that can find a
well-annotated homolog (however, very distantly related sequences
are less likely to share a function).
4.1.2 psiblast Works
Best with Large Databases
The BLAST and FASTA programs are optimized for identifying
distantly related sequences with full-length protein and gene-length
DNA sequences. Most investigators searching for homologs to
build a Multiple Sequence Alignment will do best by using the
default search parameters provided by blastp (BLOSUM62
scoring matrix,
4.2 Changing
Scoring Matrices
and Gap Penalties
1 for gap-open and gap-extend penalties) or
fasta / ssearch (BLOSUM50,
11/
2). blastp and fasta /
ssearch search parameters have been extensively evaluated over a
very wide range of evolutionary distances and query sets; changing
the parameters almost always reduces sensitivity.
The scoring matrix and gap penalties should be changed (1) for
searches with partial-length (short) query sequences and (2) when
10/
Search WWH ::




Custom Search