Biomedical Engineering Reference
In-Depth Information
A pairwise alignment is performed using the dynamic programming algorithm of
Needleman and Wunsch [11], as modified by Gotoh [3] to obtain O(m 2 ) performance,
where m is the length of the sequences being aligned. For aligning two sequences S
and T , the process involves filling in a matrix that represents all possible alignments
of the two sequences. If S i is the i th amino acid of protein S and T j is the j th amino
acid of protein T , then the cell A(i , j) represents a particular alignment of the two
sequences, where S i is aligned with T j . The matrix is also used to penalize gaps in the
alignment, to encourage fewer and shorter gaps. In general, there are many different
ways of assigning scores for gap , match , and mismatch ; see, for example, the article
by Thompson et al. [15] for a detailed discussion of score assignments in CLUSTAL
W. In the example given subsequently, we assume that a gap is assigned a score of
1). The aim is to
find an alignment that has a maximum score. Each step in the dynamic programming
algorithm can be summarized as follows:
2 and a match (or mismatch) is assigned a score of
+
1 (or
Each entry A(i , j) can be filled in three ways [ A(i
1, j
1 ) , A(i
1, j) , and
A(i , j
1 ) are the three elements immediately preceding A(i , j) in the matrix]
1. The last nucleotide of S(i) is aligned with T( j) . There are two pos-
sibilities, either they match (
+
1) or they mismatch (
1). The cost is
1
2. S(i) is aligned with a gap in T . The cost is A(i
A(i
1, j
1 )
±
1, j)
2
3. T( j) is aligned with a gap in S . The cost is A(i , j
1 )
2
In each case, the rest of prefix of S should be optimally aligned with the rest of
prefix of T .
We can write these three possibilities in a more mathematical form as follows:
A(i
1, j)
2,
align S(i) with a gap
A(i , j)
=
max
A(i , j
1 )
2,
align T( j) with a gap
A(i
1, j
1 )
±
1
align S(i) with T(j)
Hence, we need the three entries A(i
1 ) for filing
the entry A(i , j) . Note that all these three entries have been already computed when
we are trying to compute A(i , j) .
The time to fill the entire matrix is O(m
1, j) , A(i , j
1 ) , and A(i
1, j
n) when the two sequences are of
length m and n , respectively. The cost of the optimal alignment is the cost in the
A(m , n) th entry of the dynamic programming matrix. At every step, we can keep back
pointers to indicate where the optimal cost came from. Hence, we can determine the
optimal alignment by following these back pointers when the dynamic programming
is completed.
The value of the A(m , n) th entry in this matrix is the distance or score for the optimal
alignment of the two sequences; the higher the score, the closer the relationship
between the two sequences. This is because a higher score can be achieved only if a
large number of residues or bases in the two strings match. Recall that only matches
×
Search WWH ::




Custom Search