Bioinformatics - Computational Support for Genome Analysis - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

1. Analysis of current work

The basis for comparison of protein and gene sequences for similarity is to examine if they are

related by evolution (they have a common ancestor). However, random mutations in the

sequences with common ancestor develop over time as well as similar portions come up for

different structures and functions and this should be considered in studies. In parts of the

sequence that are critical for the function of the protein, hardly any mutations will be accepted;

nearly all changes in such regions will destroy the function [2].

One important algorithm used in sequence analysis is Dynamic Programming (DP). In

DP, large tables are built with all known previous results. The solution of the problem then

depends on the solutions of smaller ones in the table. A recursive structure for computing

optimal score in DP algorithm is designed and interdependent sub solutions are filled in the table

using the recurrence rule. The tables are created iteratively based on an optimal recurrence

function and result is computed in a bottom up fashion. The construction of this table should be

made efficiently since scanning of the table leads to quadratic running times. What if (a)

combining the solutions of smaller problems of the same kind to form the solution of a larger one

is not be possible, (b) the number of small problems to solve are unacceptably large (c) the costs

are fractional in which the efficiency of DP is limited? The reduction in search space and

employing other techniques like Top Down DP, Divide and Conquer, Greedy Approach and

Progressive Sequence Alignment, by accompanying and replacing the procedure might help in

that matter. The bottom line is that DP is applicable when the subproblems are not independent

and, the problem must be an optimisation problem.

Assumptions and inferences made are based on the evolutionary change and constitute

the context in which the alignment process takes place. An optimal alignment is the one with

maximum number of matches and minimum number of mismatches and gaps. The score of an

alignment is the sum of position scores. The gap penalty used in scoring scheme is important. It

helps deciding whether or not to accept a gap or insertion in an alignment when it is possible to

achieve a good alignment at some other neighbouring points in the sequence. One can not let

gaps and insertions occur without penalty, otherwise an unreasonable alignment with gaps would

result. Biologically, it should be natural for a protein to accept a different residue in a position,

rather than having parts of the sequence deleted or inserted. Gaps and insertions should therefore

be more rare than point mutations/substitutions [2].

In pairwise alignments, there is a two-dimensional matrix with the sequences on each

axis, and the elements in the matrix are initially the substitution coefficients, which are then

operated on to locate the best path through the matrix. The number of operations required to do

this is approximately proportional to the product of the lengths of the two sequences. Dot plot as

a graphical tool can help in aligning two sequences. Pairwise sequence alignment is basis for the

other analyses even for experimental design of PCR primer design. But, there are some problems

with pairwise alignments. For example, when many sequences that are significantly similar to

the query sequence are obtained, comparing each sequence to every other may become

impractical as the number of sequences increases. Then, multiple sequence alignment, where all

similar sequences can be compared in one single figure or table is employed. The basic idea is

that the sequences are aligned on top of each other, so that a co-ordinate system is set up, where

each row is the sequence for one protein, and each column is the same position in each sequence.

Each column corresponds to a specific residue in the prototypical protein. One may have to

introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise

alignment; thus, multiple alignments typically contain more gaps than any given pair of aligned

sequences.

Essays in Bioinformatics

Search WWH ::

Custom Search

Home