Information Technology Reference
In-Depth Information
In multiple sequence alignment, similar sequence motifs are identified and protein
families are analysed. The general method of multiple alignment has been to extend the pairwise
alignment method into a simultaneous n-wise alignment by using a DP algorithm in n
dimensions. One can implement and visualise this algorithm easily in the case of three sequences
by setting up a three-dimensional matrix instead of the two-dimensional one. Then, the same
procedure as for the two-sequence case is followed. The result is a path that goes diagonally
through the cube-shaped matrix from one corner to the opposite. The problem here is that the
time to compute this n-wise alignment becomes expensive as the number of sequences grows.
The algorithmic complexity is something like O(c 2n ), where c is a constant, and n is the number
of sequences. This is not an acceptable performance. Rather than doing a simultaneous n-wise
alignment, using so-called progressive alignment method could be preferred. Then, the
alignment is built up in stages where a new sequence is added to an existing alignment using
some rules to determine in which order and how the sequences should be added [2]. However,
there are supporting approaches such as approximation algorithms, heuristics and pruning the
search space based on the contextual information. Scoring criteria for multiple alignment could
be entropy methods scoring each column based on the probability distribution of the characters
in it; tree alignment metrics assuming knowledge of an existing phylogenic tree and weight
differences between closely related sequence pairs as more important than distant pairs; sum of
pairs metric , which is the most popular, summing up the cost of the k(k-1)/2 pairs of symbols in
each column as an upper bound. Finding the optimal sum of pairs alignment is non polynomial
(NP Complete). However, by exploiting the lower bounds given by each pairwise DP matrix, one
can heuristically reduce the number of states in the multiple DP matrix and hope to find the
optimal alignment of say 6-7 sequences of e.g. 200 characters in a reasonable amount of time.
Since the applications of multiple alignment fall beyond the range of exact solution algorithms,
we must employ heuristic methods. For example, randomly picking a sequence for deletion from
the alignment and then reinserting it at the position, which either maximises the score or with
probabilities biased toward the maximum score, a successful but not necessarily optimal result is
reached.
Alignment of aminoacid sequences differs from the nucleotide sequences. For nucleotide
sequences, a mismatch between sequences is usually scored as 1 whereas for aminoacids, the
possible pathways in which one aminoacid may be replaced by another need to be considered.
For example, Cysteine (TGT) and Tyrosine (TAT) have single but Cysteine and Methionine
(ATG) have 3 changes. The alignment of Cysteine with Tyrosine is less costly than the
alignment it with Methionine. For more than three average sized protein sequences, Heuristic
Progressive Alignment gives better results rather than DP approach, but not guaranteed to find
the optimal alignment. In Progressive Alignment procedure; n-1 pairwise alignments calculating
distance matrix; neighbour-joining tree is formed based on the similarity values in the distance
matrix; Progressive alignment following the neighbour-joining tree is performed where the most
closely related sequences are first aligned. However, there is no objective function and if an error
is introduced early in the alignment, it becomes impossible to correct it later in this procedure.
2. Conclusion and further work
For a complete method to support genome analysis (RNA, DNA, Proteins), the environment in
which the genome data resides must be examined precisely. Then, the assumptions, criteria,
limits, patterns, profiles, scoring schemes, attributes, associations, design and application of the
algorithms and the like are defined and initialised accordingly. So that, combining the math and
statistics with computational support and biological data come into the process. Manual analysis
Search WWH ::




Custom Search