Concepts of Similarity in Bioinformatics - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

an analogy, features of one object are paired with features of another object. Which features

are paired is often a subjective choice. Matching of molecular structure relies in comparing

molecular descriptions finding an optimal match between the features of the mathematical

representations.

The simplest example of matching is the alignment of short character strings of

equal length described in Figure 6. Another example is to find the exact occurrence of a

short character string query within another, longer string.

With these examples one can illustrate, without formal definition, some of the

important properties of matching used in bioinformatics. Given two descriptions A, B

consisting of i and j elements, respectively, let's define a mapping m : A Æ B that assigns

certain k elements in A to certain l elements in B ; it is not necessary that all elements in A

have an element assigned in B and vice versa. In the simplest case, the mapping is one-to-

one, so certain elements in A will have a pair in B (so k=l ). In other cases, we may get

multiple matching.

Types of alignments [2]. From the philosophical point of view, matching

(alignments) of structural descriptions can arise from three specific sources. i) Due to prior

knowledge, only some parts of the molecules are considered when establishing a matching.

For example, backbone atoms of a protein must match backbone atoms of the other protein,

when the 3-D structures are compared. ii) If we deal with unstructured descriptions, such as

vectors, the elements - the vector components - match by definition, which is called

canonical matching; iii) Finally, we might be interested in the maximal matching of two

larger structures given in the form of structured descriptions (consisting of both entities and

relationships), and then we need an optimizeable similarity measure such as in equation 5,

in order to find a maximal alignment. The number of all possible alignments is very high,

so finding the optimum is often very compute-intensive and sometimes intractable problem.

Algorithmic solutions From the algorithmic point of view, the methods can be

subdivided a) according to the structure types (character strings, graphs, 3D structures), b)

according to the nature of the matches that are being sought (exact matching, approximate

matching), or according to number of partners compared (pairwise alignments or multiple

alignments).

The majority of algorithms can cope only with the simplest descriptions, character

sequences. Finding exact matches between two sequences of n and m characters

respectively, has a complexity of O(n,m) . Comparison of graphs is much more difficult,

here the majority of the problems are NP complete, i.e. computationally not tractable.

Identity of graphs is determined by graph isomorphism algorithms, and similarity of graphs

(such as protein structures, metabolic pathways) is a subgraph isomorphism problem, which

is more difficult, and is aggravated by the fact that the structures in question are labelled

graphs. Rigorous comparison of complex descriptions such as 3D structures is np-hard.

Time raqirements can be decreased if we use unstructured descriptions that do not

need an alignment for comparison. These descriptions are simpler, and perform well only if

one can find an adequate resolution, such as a multidimensional vector. The calculations are

fast, but there is no guarantee that the results will be similar to those produced by alignment

methods. On the other hand, such methods are useful as preliminary filters used to screen

large databases.

Heuristic solutions provide the second general avenue for diminishing computer

times. Most of the alignment methods use some kind of heuristics. There are two important

heuristics that are used to simplify the process of alignment, the principle of linearity and

the use of higher order descriptions.

The principle of linearity is based on the chain-like topology of protein and DNA

molecules. All biologically important alignments contain short stretches that are very

similar or identical between the two molecules. So instead of testing all possible

Essays in Bioinformatics

Search WWH ::

Custom Search

Home