Biology Reference
In-Depth Information
accurate shuffled statistical estimates. All the FASTA programs
shuffle the second (library) sequence to produce statistical esti-
mates, and shuffles of protein sequences, which are produced by
fastx , more accurately reflect the distribution of unrelated
sequence scores. Randomly shuffled DNA sequences are a less
accurate model of unrelated DNA sequences.
By default, the expectation values provided by the FASTA
programs when only two sequences are compared, and by the
BLAST programs in bl2seq mode (e.g., blastp -query seq1 -
target seq2 ) are based on a database size of one sequence, rather
than the size of the database that was initially searched to
identify the candidate homolog. Since the E ()-value is the product
of the pairwise alignment score probability and the database size
( E
D ), the two-sequence expectation values will be
10,000-10,000,000 times more significant than those calculated
in the original search, depending on the original database size.
For the FASTA programs, the expectation value can be adjusted
with the -Z dbentry option, e.g., ssearch -Z 500000 seq1
seq2 would increase the expectation value 500,000-fold, to reflect
the fact that seq2 was originally found in a search of UniProtKB/
Swiss-Prot (which contains about 500,000 entries). Without
this correction, an alignment found in a search of the refseq_
protein database (13 million entries) with an expectation of
10 would have an apparent expectation of 10
p
¼
=
13
;
000
;
000
¼
7E
7, which is incorrect, because it ignores the size of the database
where the “similarity” was originally found. The expectation values
produced by blastp in two-sequence mode must be corrected for
the original database size as well.
In addition to the automatic shuffled statistical estimates that
are produced whenever two sequences are compared by the FASTA
programs, fasta , fastx , and ssearch can display two expecta-
tion values using the -z 21 command line option. When -z 21 is
used, two expectation values are reported: (1) the standard expect
value calculated from the distribution of similarity scores calculated
in the search and (2) a second E 2() value calculated by shuffling the
high scoring sequences found in the initial search. For average
composition proteins, the E () and E 2() values will be very similar,
but for biased composition proteins, the E 2() value will be more
conservative. The E 2() value can be helpful in translated searches,
where out-of-frame translations can produce biased composition
low-complexity regions.
The BLAST programs do not explicitly provide statistical
estimates based on shuffled sequences, but it is possible to confirm
the accuracy of BLAST statistical estimates by looking for the high-
est scoring (lowest expect ) unrelated sequence in the list of
high-scoring sequences. If the statistical estimates are accurate,
the highest scoring unrelated sequence should have an expect
value
1. Since lack of significant similarity cannot be used to
Search WWH ::




Custom Search