Bioinformatics - Computational Support for Genome Analysis - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

In multiple sequence alignment, similar sequence motifs are identified and protein

families are analysed. The general method of multiple alignment has been to extend the pairwise

alignment method into a simultaneous n-wise alignment by using a DP algorithm in n

dimensions. One can implement and visualise this algorithm easily in the case of three sequences

by setting up a three-dimensional matrix instead of the two-dimensional one. Then, the same

procedure as for the two-sequence case is followed. The result is a path that goes diagonally

through the cube-shaped matrix from one corner to the opposite. The problem here is that the

time to compute this n-wise alignment becomes expensive as the number of sequences grows.

The algorithmic complexity is something like O(c 2n ), where c is a constant, and n is the number

of sequences. This is not an acceptable performance. Rather than doing a simultaneous n-wise

alignment, using so-called progressive alignment method could be preferred. Then, the

alignment is built up in stages where a new sequence is added to an existing alignment using

some rules to determine in which order and how the sequences should be added [2]. However,

there are supporting approaches such as approximation algorithms, heuristics and pruning the

search space based on the contextual information. Scoring criteria for multiple alignment could

be entropy methods scoring each column based on the probability distribution of the characters

in it; tree alignment metrics assuming knowledge of an existing phylogenic tree and weight

differences between closely related sequence pairs as more important than distant pairs; sum of

pairs metric , which is the most popular, summing up the cost of the k(k-1)/2 pairs of symbols in

each column as an upper bound. Finding the optimal sum of pairs alignment is non polynomial

(NP Complete). However, by exploiting the lower bounds given by each pairwise DP matrix, one

can heuristically reduce the number of states in the multiple DP matrix and hope to find the

optimal alignment of say 6-7 sequences of e.g. 200 characters in a reasonable amount of time.

Since the applications of multiple alignment fall beyond the range of exact solution algorithms,

we must employ heuristic methods. For example, randomly picking a sequence for deletion from

the alignment and then reinserting it at the position, which either maximises the score or with

probabilities biased toward the maximum score, a successful but not necessarily optimal result is

reached.

Alignment of aminoacid sequences differs from the nucleotide sequences. For nucleotide

sequences, a mismatch between sequences is usually scored as 1 whereas for aminoacids, the

possible pathways in which one aminoacid may be replaced by another need to be considered.

For example, Cysteine (TGT) and Tyrosine (TAT) have single but Cysteine and Methionine

(ATG) have 3 changes. The alignment of Cysteine with Tyrosine is less costly than the

alignment it with Methionine. For more than three average sized protein sequences, Heuristic

Progressive Alignment gives better results rather than DP approach, but not guaranteed to find

the optimal alignment. In Progressive Alignment procedure; n-1 pairwise alignments calculating

distance matrix; neighbour-joining tree is formed based on the similarity values in the distance

matrix; Progressive alignment following the neighbour-joining tree is performed where the most

closely related sequences are first aligned. However, there is no objective function and if an error

is introduced early in the alignment, it becomes impossible to correct it later in this procedure.

2. Conclusion and further work

For a complete method to support genome analysis (RNA, DNA, Proteins), the environment in

which the genome data resides must be examined precisely. Then, the assumptions, criteria,

limits, patterns, profiles, scoring schemes, attributes, associations, design and application of the

algorithms and the like are defined and initialised accordingly. So that, combining the math and

statistics with computational support and biological data come into the process. Manual analysis

Essays in Bioinformatics

Search WWH ::

Custom Search

Home