Comparison of sequences, protein 3D structures and genomes - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

trimming. We will use HSPs as an example, adding that the description of gapped BLAST,

FASTA and Smith/Waterman scores follows a similar statistics. The random emergence of

HSP s was studied on random sequences in which the occurrence amino acid residues is

independent, with specific background probabilities for the various residues. For two

sufficiently long ( m and n ) sequences, the expected number of HSPs with score at least S is

given by the formula

[1]

E

Kmne

O

S

where K and O are constants that can be considered a can be as natural scales for the search

space of size

m u and the scoring system. The raw score S is defined by a formula given

in figure x. The number of random HSPs with score t S is described by a Poisson

distribution and the probability of finding at least one such HSP is

n

[2]

E

P

1

e

P is the statistical significance, the probability of finding a score S (or bigger) by

chance. It is important to note that this simple statistics is also approximately valid for

gapped alignments used by modern alignment programs, and this makes it possible to give

a more objective, probabilistic interpretation to similarity scores.

Global alignments are found via an exhaustive search for the maximal matching

between two sequences, based on such methods as the Needleman-Wunsch algorithm [1].

Global alignment scores can be transformed to metric distance scores, which is important

for clustering. On the other hand, very little is known about the random distribution of

optimal global alignment scores, so a rigorous probabilistic interpretation is not possible in

this case. A practical approach is based on generating many random sequence pairs of the

appropriate length and composition, and calculating the optimal alignment score for each.

The average S r and the standard deviation V r of the random scores can then be compared

with original score S score, and a Z score

[3]

S

Z

r

V

r

can be used as an approximate measure of significance. Namely, even though Z resembles

the Student t value, but rigorously speaking it cannot be converted into a P value since the

underlying distribution is not a normal distribution. Only an approximate interpretation is

thus possible, for example if 100 random alignments have scores inferior to the alignment

of interest, the P -value in question is likely less than 0.01. It is important to note that the

meaning of this statistics is different from the one derived from a database of random

similarities (equation 16). Namely, for two sequences of similar, but unusual amino acid

composition, the Z-score may be a low value, even is the two sequences compared are both

very different from the rest of the database.

The general methods of sequence comparison can be used to divide the sequence

database into clusters. In principle, a metric distance measure (such as can be derived from

global alignment scores) is a prerequisite for statistical clustering. Given the large size of

databases, both global alignments and statistical clustering methods are compute-intensive.

On the other hand, the protein sequence space is sparsely populated and the existing natural

sequences form well-separated clusters, which makes it possible to use efficient,

approximate methods for clustering. Krause and Vingron used a threshold-based, iterative

procedure based on BLAST for identifying consistent protein clusters [10,11]. The result an

objective picture of the sequence space in terms of similarities, but the clusters have to be

Essays in Bioinformatics

Search WWH ::

Custom Search

Home