Information Technology Reference
In-Depth Information
matches between bases, while the whitespace characters stand for mismatches
(probably substitutions that occurred in the evolutionary process). The spe-
cial gap symbol “-” represents potential insertions and deletions. Accordingly,
the evolutionary history between the sequences s 1 and s 2 might be that two
base substitutions (G to C at the third base, T to A two positions to the last
base), and an insertion of a G into the central CC or a deletion of the G from
the central CGC occurred.
Fig. 3.3 Simple multiple protein sequence alignment
A slightly more complex alignment is shown in Figure 3.3, which depicts
a part of an alignment of multiple protein sequences. As in the Figure 3.2,
the asterisk marks exact matches and the whitespace character marks mis-
matches. Additionally, the colon symbol is used to indicate conserved sub-
stitutions (i.e., replacements of amino acids by other ones that have similar
chemical properties) while a single dot indicates semi-conserved substitutions
(i.e., substitutions by an amino acid that has a similar spatial structure, but
that does not have the same chemical properties).
The quality of an alignment can be expressed by a score , that is, a numer-
ical value that is the sum of the scores of all positions in the alignment. For
instance, simply assume the score for a match to be +1, for a mismatch 0
and for a gap -1, then the score for the example given in Figure 3.2 would be
7. Given a particular scoring function, the alignment that yields the highest
score under these conditions is called the optimal alignment. Choosing appro-
priate scoring strategies is crucial in order to obtain reliable results. Roughly
speaking, scoring functions for nucleotide sequences (consisting of four bases)
are fundamentally different from those for amino acid sequences (20 amino
acids that are encoded by base triples), and taking into account further as-
pects like evolutionary time or insertion/deletion probabilities makes the task
no easier. Although the community has developed a set of standard scoring
methods for common purposes (like, e.g., the PAM substitution matrix [81]
or the BLOSUM matrices [129], it is often necessary to adjust the scoring
scheme to the specific present case.
Pairwise sequence alignments are usually computed by dynamic program-
ming methods, such as the Needleman-Wunsch alignment algorithm [236] or
the Smith-Waterman algorithm [288]. In realistic experiments, however, it is
often necessary to compare more than two sequences and hence to compute
multiple sequence alignments. Applying the aforementioned methods for the
alignment of multiple sequences becomes numerically intractable very fast,
since it corresponds to searches through high-dimensional spaces instead of
 
Search WWH ::




Custom Search