Information Technology Reference
In-Depth Information
trimming. We will use HSPs as an example, adding that the description of gapped BLAST,
FASTA and Smith/Waterman scores follows a similar statistics. The random emergence of
HSP s was studied on random sequences in which the occurrence amino acid residues is
independent, with specific background probabilities for the various residues. For two
sufficiently long ( m and n ) sequences, the expected number of HSPs with score at least S is
given by the formula
[1]
E
Kmne
O
S
where K and O are constants that can be considered a can be as natural scales for the search
space of size
m u and the scoring system. The raw score S is defined by a formula given
in figure x. The number of random HSPs with score t S is described by a Poisson
distribution and the probability of finding at least one such HSP is
n
[2]
E
P
1
e
P is the statistical significance, the probability of finding a score S (or bigger) by
chance. It is important to note that this simple statistics is also approximately valid for
gapped alignments used by modern alignment programs, and this makes it possible to give
a more objective, probabilistic interpretation to similarity scores.
Global alignments are found via an exhaustive search for the maximal matching
between two sequences, based on such methods as the Needleman-Wunsch algorithm [1].
Global alignment scores can be transformed to metric distance scores, which is important
for clustering. On the other hand, very little is known about the random distribution of
optimal global alignment scores, so a rigorous probabilistic interpretation is not possible in
this case. A practical approach is based on generating many random sequence pairs of the
appropriate length and composition, and calculating the optimal alignment score for each.
The average S r and the standard deviation V r of the random scores can then be compared
with original score S score, and a Z score
[3]
S
S
Z
r
V
r
can be used as an approximate measure of significance. Namely, even though Z resembles
the Student t value, but rigorously speaking it cannot be converted into a P value since the
underlying distribution is not a normal distribution. Only an approximate interpretation is
thus possible, for example if 100 random alignments have scores inferior to the alignment
of interest, the P -value in question is likely less than 0.01. It is important to note that the
meaning of this statistics is different from the one derived from a database of random
similarities (equation 16). Namely, for two sequences of similar, but unusual amino acid
composition, the Z-score may be a low value, even is the two sequences compared are both
very different from the rest of the database.
The general methods of sequence comparison can be used to divide the sequence
database into clusters. In principle, a metric distance measure (such as can be derived from
global alignment scores) is a prerequisite for statistical clustering. Given the large size of
databases, both global alignments and statistical clustering methods are compute-intensive.
On the other hand, the protein sequence space is sparsely populated and the existing natural
sequences form well-separated clusters, which makes it possible to use efficient,
approximate methods for clustering. Krause and Vingron used a threshold-based, iterative
procedure based on BLAST for identifying consistent protein clusters [10,11]. The result an
objective picture of the sequence space in terms of similarities, but the clusters have to be
Search WWH ::




Custom Search