Biology Reference
In-Depth Information
maximum sensitivity is not the first priority. Scoring matrices and
gap penalties have a implicit target evolutionary distance; the BLO-
SUM62
1 parameters used by
blastp
target alignments
that are about 30 % identical; the
fasta
/
ssearch
scoring para-
meters (BLOSUM50,
11/
2) target 25 % identical alignments
(Table
7
). But the scoring parameters that work best for distant
relationships require long alignments [
20
]; to provide 50 bits
of statistical significance, BLOSUM62
12/
11/
1 must align
>
100
amino-acids, and BLOSUM50
200 amino-
acid residues (Table
7
). BLOSUM62 and BLOSUM50 work well
for full-length proteins or protein domains that are longer than
100-200 residues, but searches with short- (
12/
2 must align
>
150 nt) or even
medium- (300-400 nt) read lengths produce translated protein
sequences in between 50 and 133 amino-acids and require shal-
lower scoring matrices.
<
Shallow matrices for short sequences
—Shallower scoring matrices
allow short query sequences to produce significant similarity scores
(bit scores). For searches with shorter query sequences,
blastp
provides the
-task blastp-short
option for query sequences
shorter than 30 amino-acids, which shifts the scoring matrix to
-
matrix PAM30
.
blastn
provides the
-task blastn-short
option, which strengthens the mismatch penalty from +1/
2to
+1/
90 % to more
than 99
%
.
blastx
does not have a
-taskblastx-short
option,
but
-matrix PAM30
has a similar effect, and dramatically improves
the expectation values in searches with query sequences shorter
than 100 nt. Investigators performing large-scale
blastx
searches
with datasets that include shorter DNA queries should sort their
query sequences by length, and then use
-matrixPAM30
for query
sequences shorter than about 120 nt,
-matrix PAM70
for queries
from 120 to 300 nt long, and the default
-matrix BLOSUM62
for
queries longer than 300 nt.
The FASTA programs provide a more finely graded set of
scoring matrices, and the programs can automatically adjust the
scoring matrix based on the length of the query sequence.
FASTA scoring matrices are set using the
-smatrix-name
option,
where
matrix-name
can be one of the sixteen matrices; scoring
matrices include BLOSUM50, BLOSUM62 [
14
] and VT10
...
3 and shifts the target percent identity from
VT200 [
21
]. In addition, the FASTA programs can use scoring
matrix values provided in a file, so any scoring matrix can be used.
To accommodate searches with different query lengths, the FASTA
programs offer a
variable
scoring matrix option; the
-s ?BP62
option indicates that the BLOSUM62 matrix, with the
1
gap penalties used by
blastp
be used for long queries, but the “?”
indicates that the scoring matrix should be adjusted to ensure
that the query can produce a 40-bit score against an average
length protein sequence. When a short sequence is encountered,
11/