BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

accurate shuffled statistical estimates. All the FASTA programs

shuffle the second (library) sequence to produce statistical esti-

mates, and shuffles of protein sequences, which are produced by

fastx , more accurately reflect the distribution of unrelated

sequence scores. Randomly shuffled DNA sequences are a less

accurate model of unrelated DNA sequences.

By default, the expectation values provided by the FASTA

programs when only two sequences are compared, and by the

BLAST programs in bl2seq mode (e.g., blastp -query seq1 -

target seq2 ) are based on a database size of one sequence, rather

than the size of the database that was initially searched to

identify the candidate homolog. Since the E ()-value is the product

of the pairwise alignment score probability and the database size

( E

D ), the two-sequence expectation values will be

10,000-10,000,000 times more significant than those calculated

in the original search, depending on the original database size.

For the FASTA programs, the expectation value can be adjusted

with the -Z dbentry option, e.g., ssearch -Z 500000 seq1

seq2 would increase the expectation value 500,000-fold, to reflect

the fact that seq2 was originally found in a search of UniProtKB/

Swiss-Prot (which contains about 500,000 entries). Without

this correction, an alignment found in a search of the refseq_

protein database (13 million entries) with an expectation of

10 would have an apparent expectation of 10

p

¼

=

13

;

000

;

000

¼

7E

7, which is incorrect, because it ignores the size of the database

where the “similarity” was originally found. The expectation values

produced by blastp in two-sequence mode must be corrected for

the original database size as well.

In addition to the automatic shuffled statistical estimates that

are produced whenever two sequences are compared by the FASTA

programs, fasta , fastx , and ssearch can display two expecta-

tion values using the -z 21 command line option. When -z 21 is

used, two expectation values are reported: (1) the standard expect

value calculated from the distribution of similarity scores calculated

in the search and (2) a second E 2() value calculated by shuffling the

high scoring sequences found in the initial search. For average

composition proteins, the E () and E 2() values will be very similar,

but for biased composition proteins, the E 2() value will be more

conservative. The E 2() value can be helpful in translated searches,

where out-of-frame translations can produce biased composition

low-complexity regions.

The BLAST programs do not explicitly provide statistical

estimates based on shuffled sequences, but it is possible to confirm

the accuracy of BLAST statistical estimates by looking for the high-

est scoring (lowest expect ) unrelated sequence in the list of

high-scoring sequences. If the statistical estimates are accurate,

the highest scoring unrelated sequence should have an expect

value

1. Since lack of significant similarity cannot be used to

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home