Biology Reference
In-Depth Information
further information about the input sequences is available. But, as
mentioned above, there is no guarantee that a mathematically
optimal alignment makes also sense from a biological point of
view. Where possible, it is advisable to exploit additional informa-
tion that may be available about the sequences to be aligned.
With more and more known genes and proteins, it is likely that
sequences under study have known homologs in a database. Infor-
mation about these homologs can also be used for improved align-
ment. For example, the program DbCLUSTAL [ 33 ] uses BLAST
[ 21 ] to search for homologs of the input sequences in databases.
These homologs are then aligned together with the original
sequence set using CLUSTAL W , and finally the database hits are
removed again to obtain a MSA of the original set of input
sequences. If local similarities to database sequences are found,
they are used as a sort of anchor points. It could be shown that
this approach increases the performance of CLUSTALW . Similarly,
the latest version of CLUSTAL , CLUSTAL Omega can use searches
to the Pfam database [ 34 ] and align those positions of the input
sequences together that match the same position in a Pfam domain.
Inspired by this approach, we implemented a version of DIA-
LIGN for protein alignment that takes matches to Pfam domains as
additional input information [ 35 ]. More specifically, we construct
blocks of segments of the input sequences matching to the same
segment of a Pfam domain. These blocks are then preferentially
included into the output MSA. We tested different ways of inte-
grating our “blocks” into our MSA procedure. In a straight forward
way, the identified blocks are used as anchor points for the
subsequent alignment of the input sequences. Here, we also devel-
oped an interactive approach where the blocks that are defined by
common Pfam matches can be inspected by the user and accepted
or rejected based on expert knowledge.
Alternatively, putative homologies identified by matches to the
same Pfam domain can be used together with similarities found by
pairwise DIALIGN alignment; these different similarities are finally
integrated into one single output MSA using a graph-theoretical
algorithm that we recently proposed [ 23 ]. We tested these new
approaches systematically using BAliBASE [ 36 , 37 ] and SABmark
[ 38 ] as benchmark databases. Using homologies to Pfam domains
could considerably improve the performance of DIALIGN [ 35 ].
5 Altavist: Comparing Multiple Alignments
Every molecular biologist knows that the reliability of automated
MSA methods is limited. In fact, many biologists argue that the
best way of creating multiple alignments is still to have experts
creating them manually . Such alignments are often superior to
computationally computed alignments. For the same reasons, it is
Search WWH ::




Custom Search