BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

maximum sensitivity is not the first priority. Scoring matrices and

gap penalties have a implicit target evolutionary distance; the BLO-

SUM62

1 parameters used by blastp target alignments

that are about 30 % identical; the fasta / ssearch scoring para-

meters (BLOSUM50,

11/

2) target 25 % identical alignments

(Table 7 ). But the scoring parameters that work best for distant

relationships require long alignments [ 20 ]; to provide 50 bits

of statistical significance, BLOSUM62

12/

11/

1 must align

>

100

amino-acids, and BLOSUM50

200 amino-

acid residues (Table 7 ). BLOSUM62 and BLOSUM50 work well

for full-length proteins or protein domains that are longer than

100-200 residues, but searches with short- (

12/

2 must align

>

150 nt) or even

medium- (300-400 nt) read lengths produce translated protein

sequences in between 50 and 133 amino-acids and require shal-

lower scoring matrices.

<

Shallow matrices for short sequences —Shallower scoring matrices

allow short query sequences to produce significant similarity scores

(bit scores). For searches with shorter query sequences, blastp

provides the -task blastp-short option for query sequences

shorter than 30 amino-acids, which shifts the scoring matrix to -

matrix PAM30 . blastn provides the -task blastn-short

option, which strengthens the mismatch penalty from +1/

2to

+1/

90 % to more

than 99 % . blastx does not have a -taskblastx-short option,

but -matrix PAM30 has a similar effect, and dramatically improves

the expectation values in searches with query sequences shorter

than 100 nt. Investigators performing large-scale blastx searches

with datasets that include shorter DNA queries should sort their

query sequences by length, and then use -matrixPAM30 for query

sequences shorter than about 120 nt, -matrix PAM70 for queries

from 120 to 300 nt long, and the default -matrix BLOSUM62 for

queries longer than 300 nt.

The FASTA programs provide a more finely graded set of

scoring matrices, and the programs can automatically adjust the

scoring matrix based on the length of the query sequence.

FASTA scoring matrices are set using the -smatrix-name option,

where matrix-name can be one of the sixteen matrices; scoring

matrices include BLOSUM50, BLOSUM62 [ 14 ] and VT10

...

3 and shifts the target percent identity from

VT200 [ 21 ]. In addition, the FASTA programs can use scoring

matrix values provided in a file, so any scoring matrix can be used.

To accommodate searches with different query lengths, the FASTA

programs offer a variable scoring matrix option; the -s ?BP62

option indicates that the BLOSUM62 matrix, with the

1

gap penalties used by blastp be used for long queries, but the “?”

indicates that the scoring matrix should be adjusted to ensure

that the query can produce a 40-bit score against an average

length protein sequence. When a short sequence is encountered,

11/

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home