Biology Reference
In-Depth Information
However, if we know that
S
1,
i
aligns to
S
3,
l
in a third sequence
S
3
and
S
3,
l
aligns well to
S
2,
k
, then we can choose to align
S
1,
i
to
S
2,
k
.
For example in the given sequences
S
1
and
S
2
, the “FASTCAT”
substring of
S
2 can comparably be aligned to the “LASTFAT” and
“FATCAT” substrings of
S
2
. The existence of a third sequence
S
3
rectifies this ambiguity as follows:
S
1
:
GARFIELDTHE LAST FA
TCAT
S
3
:
GARFIELDTHE VERY FAST CAT
S
2
:
GARFIELDTHE
FAST CAT
Here,
w
(
S
1
,
S
3
)
100. The weight of the
alignment
S
1
and
S
2
through
S
3
is
w
(
S
1
,
S
2
)
¼
77 and
w
(
S
3
,
S
2
)
¼
¼
min(
w
(
S
1
,
S
3
),
w
(
S
3
,
S
2
))
77 so that we update the weight of the alignment
S
1
and
S
2
in the primary library with a new score 77 + 88
¼
165.
Although this is lower than the optimum pairwise alignment of
S
1
and
S
2
, we provide a better overall MSA.
Finally, T-Coffee produces its final MSA by using the tradi-
tional progressive alignment-based approaches on the modified
pairwise scores in the secondary library. An appealing option of
T-Coffee is that the program welcomes user-provided input
sequences for the primary library. Moreover, the latest version of
T-Coffee includes structural information for improved multiple
protein alignments [
14
].
¼
MAFFT, a high speed multiple sequence alignment program,
implements Fast Fourier Transform (FFT) to identify homologous
regions quickly after converting amino acid sequences into two
feature vectors [
15
]. These feature vectors, which are composed
of six components in total, represent volume and polarity of amino
acid sequences [
16
]. The motivating idea in MAFFT is that highly
correlated sequences may have homologous regions and sequence
correlation is calculated by FFT of normalized volume and polarity
vectors,
v
(
a
) and
p
(
a
), respectively
3.3 MAFFT
^
v
ð
a
Þ¼½
v
ð
a
Þ
v
=σ
v
p
Þp
ð
a
Þ¼½
p
ð
a
=σ
p
:
Correlation between two sequences is then defined as:
c
ð
k
Þ¼
c
v
ð
k
Þþ
c
p
ð
k
Þ;
where
Þ¼
P
1
nN;
1
nþkM
^
c
v
ð
k
v
1
ð
n
Þ^
v
2
ð
n
þ
k
Þ
l
Þ¼
P
1
nN;
1
nþkM
p
1
Þp
2
Þ
l
N
and
M
denote the length of sequences.
c
p
ð
k
ð
n
ð
n
þ
k
l