Information Technology Reference
In-Depth Information
alignments, one can start identifying the highly similar chain segments and then combining
them into larger alignment, which is computationally much less expensive. This philosophy
it can be used both for sequences and for 3-D chain. Let A and B be polypeptide chains of l
and k residues, respectively, and n
A denote a contiguous fragment of n residues of protein A
starting at residue i . In this case,
n
n
A 1 ,
n
A 2
will be an ordered list of
overlapping fragment descriptions covering the entire chain of protein A . Let's provide such
a list for both proteins and compare the fragments using a proximity measure
)
[
A
]
=
A
n
n
1
n
i
n
j
PM . PM must be a proximity measure that can be unequivocally
determined for any two fragments. In most cases this means that no alignment is needed
between fragments compared (alignment and gaps would make the process prohibitively
expensive). In some cases the precomputed or a priori known values of PM are stored in
lookup tables.
PM
(
A
,
B
i
,
j
PM , values define a so-called similarity matrix, which is a symmetrical
matrix of k x l elements (more accurately (l-n+1)x(k-n+1) elements). If
i
j
PM , is a
similarity measure, similar segments within two proteins appear as series of large values
parallel to the diagonal of the matrix. The similarity matrix is used - under various names -
for the determination of an overall alignment in several algorithms, many of which use
dynamic programming techniques. Global alignments that extend from the beginning to the
end of both sequences are found via an exhaustive search for the maximal matching, based
on such methods as the Needleman-Wunsch algorithm [38]. Local alignments can be found
via similar strategies, such as the Smith-Waterman and the Sellers [39] algorithms, as well
as by heuristic solutions such as the FASTA and the BLAST algorithms.
All of these algorithms were developed for sequence alignment, where the
fragments are overlapping n -words of amino acids, the scoring is be based on a sequence
similarity score such as in equation 5. Naturally, one can also use 3-D description of the
backbone for longer peptide segments of 3D structures, and use the rmsd distance for
comparison. The actual algorithms are problem dependent, further examples are given in
section 4.
The principle of higher order descriptions is based on the simple fact that comparing
a smaller number of higher-order elements takes less time. The best example is the
comparison of protein structures in terms of secondary structure elements. In addition to
decreased computer time, higher order descriptions, such as secondary structure elements
incorporate a great deal of human knowledge. As a consequence, the results of comparisons
are usually close to human understanding.
Finally we mention that alignment is an optimisation problem, so all optimisation
algorithms can be used for aligning structures. The optimum is understood in the context of
the chosen representation and scoring scheme and may involve parameters that have to be
adjusted on an empirical basis. Most users would therefore agree that alignments produced
by computer programs can always be improved upon visual inspection.
i
j
4. Similarity spaces
In the foregoing, we reviewed the mathematical concepts relevant to the definition of
similarity in bioinformatics: equivalence, matching, partial ordering, and proximity. These
relationships arise in the context of a mathematical space. A mathematical space suitable
for molecular similarity analysis is called a molecular similarity space and is defined to
consist of a) a set of mathematical representations of molecules and b) one or more
similarity relationships defined on this set. For example, one of the possible protein
similarity spaces contains the sequences as representations, plus a set of equivalence
Search WWH ::




Custom Search