Biology Reference
In-Depth Information
Expectation values —The E-value of an alignment reflects both
the alignment score and the size of the database from which the
alignment was identified. Thus, FASTA E ()-values are reported in
the context of a number; E(100000) ¼ 1E-6 indicates that this
alignment would occur one time in a million searches of a database
of size 100,000. 5 For example, an alignment of two 400 residue
proteins with a 40-bit alignment score would have an E (4,000)-
value
mn 2 bits D
2 40
0006 ( m , n
are the lengths of the query and library sequence; D is the database
size); the same 40-bit alignment found in the RefSeq database
(
400
400
4
;
000
0
:
¼
13 million entries) would have E (13,000,000)
1.9, a value
3,000 times greater
(and no longer
statistically significant,
Subheading 4.1 ).
The E ()-value or Expect reports the statistical significance of
an alignment score in the context of a database search. E ()-values
between 0.001 and 0.01 are widely used as a threshold for inferring
homology ( psiblast uses 0.005 as its default for including a new
sequence into the PSSM profile). An E ()-value of 0.001 implies that
the alignment score would happen only once in 1,000 searches
by chance. However, in metagenomics and other large-scale
analyses, millions of similarity searches may be run, so an alignment
with an E -v lueof0.001wouldbeexp tedtoo cur
0.001
1,000-times by chance. Thus, for large-scale
analyses, much more conservative statistical thresholds are often
applied; 10 6 to 10 10 orevenlower.Whileverystrict(10 10 ) thresh-
olds for large-scale searches can dramatically reduce the number of
false-positive assignments, such conservative significance thresholds
increase the number of false negatives. Many distantly related
clear homologs will have E ()-values between 10 10 and 10 3 .
1,000,000
¼
Bit scores —Because an E ()-value is database-size (search-size)
dependent, many investigators record bit scores, rather than E ()-
values, in large-scale analyses. The formula for converting bit scores
to E ()-values is shown above, but, as a rule of thumb, alignments
with scores
40-bits are never statistically significant; scores
between 40 and 50 bits, are only significant in relatively small
databases; and scores
<
50-bits will be significant in databases as
large as 10,000,000 entries. A one bit bit-score change corresponds
to a twofold change in statistical significance, so a 10-bit increase in
score improves the statistical significance
>
1,000-fold. At the
NCBI web site BLAST summary, alignments with scores from 50
to 80 bits are plotted as green bars; these alignments will be
statistically significant in almost any database context (likewise,
alignments
<
40-bits
are plotted in black;
they are never
significant).
5 The BLAST programs use a slightly different formulation of the Expect value; rather than using the number of
entries in the database, BLAST uses the combined length of all the sequences in the database. For average length
proteins, the result of the two calculations is identical.
Search WWH ::




Custom Search