Biology Reference
In-Depth Information
further information about the input sequences is available. But, as
mentioned above, there is no guarantee that a mathematically
optimal alignment makes also sense from a biological point of
view. Where possible, it is advisable to exploit additional informa-
tion that may be available about the sequences to be aligned.
With more and more known genes and proteins, it is likely that
sequences under study have known homologs in a database. Infor-
mation about these homologs can also be used for improved align-
ment. For example, the program
DbCLUSTAL
[
33
] uses
BLAST
[
21
] to search for homologs of the input sequences in databases.
These homologs are then aligned together with the original
sequence set using
CLUSTAL W
, and finally the database hits are
removed again to obtain a MSA of the original set of input
sequences. If local similarities to database sequences are found,
they are used as a sort of anchor points. It could be shown that
this approach increases the performance of
CLUSTALW
. Similarly,
the latest version of
CLUSTAL
,
CLUSTAL Omega
can use searches
to the
Pfam
database [
34
] and align those positions of the input
sequences together that match the same position in a
Pfam
domain.
Inspired by this approach, we implemented a version of
DIA-
LIGN
for protein alignment that takes matches to
Pfam
domains as
additional input information [
35
]. More specifically, we construct
blocks
of segments of the input sequences matching to the same
segment of a
Pfam
domain. These blocks are then preferentially
included into the output MSA. We tested different ways of inte-
grating our “blocks” into our MSA procedure. In a straight forward
way, the identified blocks are used as
anchor points
for the
subsequent alignment of the input sequences. Here, we also devel-
oped an
interactive
approach where the
blocks
that are defined by
common
Pfam
matches can be inspected by the user and accepted
or rejected based on expert knowledge.
Alternatively, putative homologies identified by matches to the
same
Pfam
domain can be used together with similarities found by
pairwise
DIALIGN
alignment; these different similarities are finally
integrated into one single output MSA using a graph-theoretical
algorithm that we recently proposed [
23
]. We tested these new
approaches systematically using
BAliBASE
[
36
,
37
] and
SABmark
[
38
] as benchmark databases. Using homologies to
Pfam
domains
could considerably improve the performance of
DIALIGN
[
35
].
5 Altavist: Comparing Multiple Alignments
Every molecular biologist knows that the reliability of automated
MSA methods is limited. In fact, many biologists argue that the
best way of creating multiple alignments is still to have experts
creating them
manually
. Such alignments are often superior to
computationally computed alignments. For the same reasons, it is