Biology Reference
In-Depth Information
maximum sensitivity is not the first priority. Scoring matrices and
gap penalties have a implicit target evolutionary distance; the BLO-
SUM62
1 parameters used by blastp target alignments
that are about 30 % identical; the fasta / ssearch scoring para-
meters (BLOSUM50,
11/
2) target 25 % identical alignments
(Table 7 ). But the scoring parameters that work best for distant
relationships require long alignments [ 20 ]; to provide 50 bits
of statistical significance, BLOSUM62
12/
11/
1 must align
>
100
amino-acids, and BLOSUM50
200 amino-
acid residues (Table 7 ). BLOSUM62 and BLOSUM50 work well
for full-length proteins or protein domains that are longer than
100-200 residues, but searches with short- (
12/
2 must align
>
150 nt) or even
medium- (300-400 nt) read lengths produce translated protein
sequences in between 50 and 133 amino-acids and require shal-
lower scoring matrices.
<
Shallow matrices for short sequences —Shallower scoring matrices
allow short query sequences to produce significant similarity scores
(bit scores). For searches with shorter query sequences, blastp
provides the -task blastp-short option for query sequences
shorter than 30 amino-acids, which shifts the scoring matrix to -
matrix PAM30 . blastn provides the -task blastn-short
option, which strengthens the mismatch penalty from +1/
2to
+1/
90 % to more
than 99 % . blastx does not have a -taskblastx-short option,
but -matrix PAM30 has a similar effect, and dramatically improves
the expectation values in searches with query sequences shorter
than 100 nt. Investigators performing large-scale blastx searches
with datasets that include shorter DNA queries should sort their
query sequences by length, and then use -matrixPAM30 for query
sequences shorter than about 120 nt, -matrix PAM70 for queries
from 120 to 300 nt long, and the default -matrix BLOSUM62 for
queries longer than 300 nt.
The FASTA programs provide a more finely graded set of
scoring matrices, and the programs can automatically adjust the
scoring matrix based on the length of the query sequence.
FASTA scoring matrices are set using the -smatrix-name option,
where matrix-name can be one of the sixteen matrices; scoring
matrices include BLOSUM50, BLOSUM62 [ 14 ] and VT10
...
3 and shifts the target percent identity from
VT200 [ 21 ]. In addition, the FASTA programs can use scoring
matrix values provided in a file, so any scoring matrix can be used.
To accommodate searches with different query lengths, the FASTA
programs offer a variable scoring matrix option; the -s ?BP62
option indicates that the BLOSUM62 matrix, with the
1
gap penalties used by blastp be used for long queries, but the “?”
indicates that the scoring matrix should be adjusted to ensure
that the query can produce a 40-bit score against an average
length protein sequence. When a short sequence is encountered,
11/
Search WWH ::




Custom Search