Phylogenetic Analysis Workflows - User-Level Workflow Design: A Bioinformatics Perspective

Information Technology Reference

In-Depth Information

matches between bases, while the whitespace characters stand for mismatches

(probably substitutions that occurred in the evolutionary process). The spe-

cial gap symbol “-” represents potential insertions and deletions. Accordingly,

the evolutionary history between the sequences s 1 and s 2 might be that two

base substitutions (G to C at the third base, T to A two positions to the last

base), and an insertion of a G into the central CC or a deletion of the G from

the central CGC occurred.

Fig. 3.3 Simple multiple protein sequence alignment

A slightly more complex alignment is shown in Figure 3.3, which depicts

a part of an alignment of multiple protein sequences. As in the Figure 3.2,

the asterisk marks exact matches and the whitespace character marks mis-

matches. Additionally, the colon symbol is used to indicate conserved sub-

stitutions (i.e., replacements of amino acids by other ones that have similar

chemical properties) while a single dot indicates semi-conserved substitutions

(i.e., substitutions by an amino acid that has a similar spatial structure, but

that does not have the same chemical properties).

The quality of an alignment can be expressed by a score , that is, a numer-

ical value that is the sum of the scores of all positions in the alignment. For

instance, simply assume the score for a match to be +1, for a mismatch 0

and for a gap -1, then the score for the example given in Figure 3.2 would be

7. Given a particular scoring function, the alignment that yields the highest

score under these conditions is called the optimal alignment. Choosing appro-

priate scoring strategies is crucial in order to obtain reliable results. Roughly

speaking, scoring functions for nucleotide sequences (consisting of four bases)

are fundamentally different from those for amino acid sequences (20 amino

acids that are encoded by base triples), and taking into account further as-

pects like evolutionary time or insertion/deletion probabilities makes the task

no easier. Although the community has developed a set of standard scoring

methods for common purposes (like, e.g., the PAM substitution matrix [81]

or the BLOSUM matrices [129], it is often necessary to adjust the scoring

scheme to the specific present case.

Pairwise sequence alignments are usually computed by dynamic program-

ming methods, such as the Needleman-Wunsch alignment algorithm [236] or

the Smith-Waterman algorithm [288]. In realistic experiments, however, it is

often necessary to compare more than two sequences and hence to compute

multiple sequence alignments. Applying the aforementioned methods for the

alignment of multiple sequences becomes numerically intractable very fast,

since it corresponds to searches through high-dimensional spaces instead of

Search WWH ::

Custom Search

Home