Biology Reference
In-Depth Information
Percent-identity —While E ()-values provide the most direct estimate
of statistical significance, and bit scores provide a database-
independent measure of alignment strength, investigators often
use percent identity to describe the likelihood that two sequences
are homologous. In general, if two sequences are 30 % identical
across their entire length, they can reliably be inferred to be homol-
ogous. This “rule of thumb” correctly identifies homologs, but it
misses large numbers of clearly homologous proteins.
Many alignments with E ()-values
10 6 and bit scores greater
than 60 will be less than 30 % identical. For example, in a compari-
son of E. coli protein to human proteins, there are 10,417 human:
E. coli homologs with E ()
<
10 6 , but only 36 % of these are
<
30 % identical. Percent identity is far less sensitive than expecta-
tion values and bit scores because it cannot distinguish between
common and rare identities, and it does not count conservative
amino-acid replacements. Percent identities can give a useful mea-
sure of evolutionary distance (for example, on average mammalian
orthologs are about 80 % identical), but the 30 % identity thresh-
old excludes large numbers of homologs that are readily identified
with E ()-values and bit scores.
Ideally, if E ()-values are accurate, then one can have confidence
that sequences sharing a similarity score expected one time in 1,000
by chance are almost certainly homologous. Unfortunately,
low-complexity regions, biased amino-acid composition, and
unusual sequence lengths, can violate statistical assumptions
about protein sequences, resulting in low-expectation values for
unrelated sequences. When the FASTP program was introduced
in 1985 [ 15 ], it included a program that estimated the statistical
significance of a similarity score by shuffling one of the two aligned
sequences, and recording the number of standard deviations separ-
ating original unshuffled alignment score from the mean of the
shuffled sequence alignment scores. The guidelines for inferring
homology in that paper did not account for database size, but the
shuffling strategy is still available in the FASTA programs to evalu-
ate the statistical significance of an alignment score. When any of
the FASTA programs are used to compare two sequences, the
statistical significance of the unshuffled alignment is estimated by
shuffling the second sequence and applying the appropriate statisti-
cal distribution. The -k shuffle-count command line option
sets the number of shuffles performed ( -k 250 by default). The
-vshuffle-window-size performs local window shuffles; -v20
produces each shuffled sequence by shuffling residues 1-20, 21-40,
etc. -v window-shuffled sequences preserve local composition
biases in the shuffled proteins, e.g.,
3.2 Confirming
Statistical Significance
transmembrane domain
regions.
In pairwise comparisons involving a protein and translated-
DNA sequence, e.g., fastx or tfastx , fastx will provide more
Search WWH ::




Custom Search