Special-Purpose Computing for Biological Sequence Analysis - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

7.5.2 Finding the Optimal Context

In our approach, we rely on finding several matches or several sub-sequences that are

similar to a given pattern (combined with a certain tolerance that defines the maximal

deviation from the given pattern). There is a trade-off between the length of the pattern

and the tolerance expected number of matches of this pattern in the database. The

longer the pattern, the less likely it is to find a match in the database. The larger the

tolerance, the easier it is to find a match. Both dependencies are exponential, that

is, if the sequence is a sequence over an alphabet with k letters and the pattern length

is n , then there are k l different patterns of length l , and if the database contains

M patterns, the expected number of sub-sequences that match the given pattern is

E

M/k l . To do any statistical analysis on the matches found, we need E to be of

sufficient size. To increase E forafixed l , we can reduce the size of the alphabet, that

is, instead of requesting a match with the given amino acids (there are 20 different

amino acids) we might only request the corresponding amino acid is in the same

class of amino acids (there we might place every amino acid into only 2 classes, such

as hydrophilic and hydrophobic). The following two cases demonstrate the potential

advantage of such an approach.

=

Case 1 : We look for all amino acid sub-sequences in the PDB database that match a

given sequence of five amino acids. Given the fact that the PDB database currently

contains about 10 7 amino acids and the fact that there are 3.2

10 6 different amino

acid sequences of length 5, we can expect to find about three matches.

Case 2 : We look for all amino acid sub-sequences in the PDB database that match

a given pattern that has a given amino acid in its central position and specifies to

its left an right a pattern of hydrophobicity of length 3 (i.e., we specify in which

position we expect to have a hydrophobic amino acid and in which position we

expect to have a hydrophilic amino acid). There are 1280 such patterns of length

7 and thus we can expect to find about 10 4 such patterns in the PDB database — a

much better basis for statistical analysis than in Case 1.

×

7.5.3 Develop Interactive Visualization Tools that Allow Fast

Correction of Incorrect or Improbable Predictions

There are several open source visualization tools available, which allow visualization

and rotation of proteins in a wide variety of modes. For our application, we would like

to develop interactive visualization tools that present the histogram of an angle as soon

as the curser is pointing at it. Then, it should be possible to point at the histogram and

change the corresponding dihedral angles (arrows in Figure 7.12). The visualization

tool should also have the option of highlighting amino acids of particular interest and

positions that might be involved in hydrogen bonds.

7.5.4 Extracting Structure Similarity Based on Dihedral Angles

Having the structures almost totally represented by sequences of dihedral angles,

opens the option of looking for similar structures, that is, similar sequences of dihedral

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home