Biology Reference
In-Depth Information
Although they share similar goals and strategies, BLAST and
FASTA differ in several respects: (1) BLAST and FASTA use a
different strategies for estimating statistical significance (though
the resulting estimates are very similar); (2) FASTA supports
more database formats and output alignment options; (3) there
are cosmetic differences on how the query, database, scoring para-
meters, and options are specified.
Statistical differences —The BLAST program introduced rapid heu-
ristic similarity searching based on statistical thresholds [ 6 ]. With
the development of “Karlin-Altschul” local similarity score statis-
tics [ 7 ], it became possible to set thresholds based on statistical
parameters; only sequence alignments that could produce “signifi-
cant” scores were examined, minimizing alignment computations
on unrelated sequences. For the original BLAST, which focused on
combining ungapped alignments (HSPs), the statistical parameters
could be calculated analytically, but the introduction of gapped-
BLAST [ 8 ] required that the parameters be estimated for standard
scoring matrices and gap penalties by simulating unrelated
sequences. As a result, the BLAST programs offer a fixed set of
scoring matrices and gap penalties.
FASTA uses a different approach, which calculates an approxi-
mate similarity score for every sequence in the database. FASTA
assumes that it has calculated thousands of unrelated similarity
scores in every database search and uses these scores to estimate
the required statistical parameters (if only a few sequences are
compared, unrelated sequences are produced by shuffling the
library sequences). As a result of this assumption, FASTA includes
an option to shuffle every sequence in the library if the library is not
“representative” ( -z 11 ). Since the FASTA programs estimate sta-
tistical parameters in every search, the programs provide much
more flexibility in scoring matrix and gap-penalty schoice. About
a dozen scoring matrices are built-in to the FASTA programs; other
scoring matrices can be provided from files. The FASTA programs
also allow arbitrary gap penalties.
The most common cause of misleading statistical significance
estimates ( E ()- or expect-values) is low-complexity regions in pro-
teins. Older versions of blastp used the seg program [ 9 ]to
identify and mask-out low-complexity regions. The current version
of BLAST uses a more sophisticated strategy by default [ 10 ]. The
FASTA programs can search sequence databases that are “soft-
masked” by indicating low-complexity regions with lowercase
amino acids by using the -S option. The pseg program, available
from the National Center for Biotechnology Information (NCBI)
( ftp://ftp.ncbi.nlm.nih.gov/pub/seg/pseg ) , or the segmasker
program, part of the BLAST distribution, can be used to soft-
mask entire sequence databases with lowercase characters for low
complexity.
2.2 BLAST and
FASTA Differences
Search WWH ::




Custom Search