Bioinformatics - Computational Support for Genome Analysis - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

and editing the attributes given above over the colour coded alignments are revisited and the

values are inserted into the work where necessary. In some cases, even when one has a lot of

information about the proteins, such as active site residues, secondary structure, 3D structure,

mutations, etc, it may still be necessary to make a manual alignment to fit all the data. [2].

There is exponential growth in known sequences, sequence and structure alignments. The

analysis data of those studies should be geared to the needs of bioinformaticians. For example,

the outcome of the decision whether it is similar or homologous affects the whole process. It

must again be considered that certain regions (structure and function) are of more crucial

residues. When two protein sequences have more than 25 % identical residues aligned, the

corresponding 3D structures are said to be very similar implying similar functionality. Therefore,

the sequence alignment of proteins remains to be an approximate predictor of the underlying 3D

structural alignment. However, experimental findings for evolutionary background should

consolidate these studies [3].

The operations like match, mismatch, insertion, deletion and introduction of gaps with varying

numbers, definitions even with different scoring subschemes can be utilised in scoring schemes.

Depending on the context, some changes are more plausible than others and probabilistic

interpretation of how likely one alignment versus another is performed. The success depends not

only on the parameters such as insertion and deletion penalties, substitution coefficients but also

on the order in which sequences are added to the multiple alignment process. A number of rules

are used to increase the success rate of the procedure like each sequence is weighted according to

how different it is from the other sequences. Of many different possible scoring schemes, one

can employ position-specific scores. For example, if one knows from other sources like its 3D

structure that a gap should not be allowed in a certain part of a sequence, then higher gap penalty

values could be determined in relevant calculation.

In overall calculation, the employment of local and global alignments or combination of

them where better fits should be considered. Local alignments in which the regions with high

degree of similarity in two sequences rather than globally aligning them from head to toe may be

preferred and done to support the global alignment. Sort and search techniques may be borrowed

in running alignment procedure based on the contextual information. A Context Sensitive

grammar may be formed to model the contextual information within the enacted environment of

the related process. Clustering of large multiple alignments supported with alternative

representations could well be performed. How can we represent a pattern of residues as found in

a multiple alignment? And how can we use such a pattern to search for it in other protein

sequences? The formalism devised to describe the kind of patterns we need: is regular

expressions to describe particular languages in restricted cases.

The selection and employment of algorithms constitute the major issue when we are

searching large databases. For example, a database of size 10 9 , one can not run DP algorithm to

query a string of length up to 500, because of exponential running times. However, this problem

can be handled in different ways: (a) Implementing the DP algorithms in hardware, thus

executing them much faster. The disadvantage is its high cost. Furthermore, by using parallel

hardware, the problem can be distributed efficiently to a couple of thousands of processors, and

the results can be integrated later. This approach is costly, too. (b) Using heuristics that work

much faster than the original DP algorithms and exact algorithms. Here are some measures to

take: due to the huge DB size, Preprocessing of the rather stable portions of database is done;

Substitutions are much more likely than insertions and deletions; We expect homologous

sequences to contain a lot of segments with matches or substitutions, but without insertions and

deletions and gaps. These segments can be used as starting points for further searching. [4].

Learning algorithms of artificial neural networks supported with uncertainty, probabilities,

fuzziness, heuristics could be utilised. So that learning mechanism can steer the running of the

Essays in Bioinformatics

Search WWH ::

Custom Search

Home