Biology Reference
In-Depth Information
extension), predicted or observed higher-order structures,
constraints on conserved linear motifs, may also improve the
quality of MSA.
2 Methods
2.1 Pairwise
Alignment
Alignment of two sequences, a ¼
b n , taking
account of the mutational events of substitutions, deletions, and
insertions is the fundamental procedure in any biological sequence
alignment problems. In most situations, deletions in one sequence
cannot be distinguished from insertions in the other, and hence they
are collectively called indels or gaps. Note that a gap means a
contiguous run of blanks or nulls indicating the absence of a residue.
Given an alignment of two sequences, we can easily calculate the
alignment score H ( a , b ) by adding the substitution scores S ( a i , b j )
for matched pairs a i ~ b j (a matched pair in an alignment will be
indicated by the '~' symbol hereafter) and the gap penalties g ( k ) for
individual indels of length k . Although many amino-acid substitu-
tion matrices S ( a , b ) have been published [ 3 ], most of them follow
the log-odds scoring scheme first proposed by Dayhoff et al. [ 4 ].
A few nucleotide substitution matrices reflecting biased G + C
composition and unequal rates of transitions and transversions are
also available [ 5 , 6 ].
Of the exponentially many possible alignments, we want to find
the one that has the maximal H ( a , b )( see Note 1 ). In 1970, Needle-
man and Wunsch [ 7 ] first invented an efficient algorithm categor-
ized in the dynamic programming paradigm. Needleman and
Wunsch considered the case of semi-global alignment ( see Note 2 ),
and g ( k )
a 1 ...
a m and b ¼
b 1 ...
¼
v
¼
constant with the computational complexity of
O ( L 3
m 2 n ). Later, Sellers [ 8 ] provided mathematically more
rigorous formulation in the case of global alignment and “propor-
tional gap penalty,” g ( k )
¼
¼
uk ( u is a constant), with the computa-
tional complexity of O ( L 2
mn ). Waterman et al. [ 9 ] generalized
the Sellers's formulation to any functional forms of g ( k ), but the
computational complexity went back to O ( L 3 ). Gotoh [ 10 ] showed
that the Waterman et al.'s algorithm is solved in O ( L 2 ) for an “affine
gap penalty,” g ( k )
¼
( v + uk ), where u and v are non-negative
constants. The constant term v realizes the fact that creation of a
new gap is more difficult than extension of an existing gap, and v and
u are often called “gap open penalty” and “gap extension penalty,”
respectively ( see Note 3 ). The Gotoh's algorithm is represented in a
recurrent form as follows:
¼
H i;j
¼
Max H i 1 ;j 1 þ
S
ð
a i
;
b j
Þ;
E i;j
;
F i;j
E i;j
¼
Max H i 1 :j
v
;
E i 1 ;j
u
F i;j ¼
Max H i:j 1
v
;
F i;j 1
u
:
(1)
Search WWH ::




Custom Search