Phylogenetic Tree Reconstruction: Geometric Approaches - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

One may reconstruct a phylogenetic tree among distinct groups or species from

morphological data obtained by measuring and quantifying the phenotypic properties

of representative organisms via, for example, parsimony. However, recent phyloge-

netic analysis uses nucleotide sequences encoding genes or amino acid sequences

encoding proteins as the basis for classification. Therefore in this chapter we focus on

phylogenetic tree reconstruction methods based on sequenced genes or genomic data

sets. Through evolutionary history, a character (e.g., a nucleotide) of the sequence

might be changed to another one, deleted or inserted. Thus, before reconstructing a

phylogenetic tree, we have to align an input sequence data set, e.g., identify sequences

of characters (DNA bases, amino acids, etc.) which are thought to be representing

the same (“homologous”) regions of genes (from different species or different gene

families, etc.) and then line up the sequences so that nucleotides in differing sequences

can be compared site-by-site, where one site may vary from another by mutation, i.e.,

insertions, deletions, or substitutions of characters. Such a line-up of two sequences

is an alignment ; if we have multiple sequences then the result is properly called a

multiple alignment , although we will often abuse terminology and use “alignment”

for both. Aligning multiple sequences is generally known to be an NP-hard prob-

lem [ 7 , 8 ], and there are scores of approaches to creating alignments heuristically.

Here we assume that we have a perfectly aligned sequence data set and focus on the

reconstruction of phylogenetic trees from a multiple alignment.

There are several methods to infer a phylogenetic tree from a given alignment,

including the maximum-likelihood (ML) method, distance-based methods,

parsimony-based methods, Bayesian inference methods, and so on (see [ 1 ]formore

details). In this chapter we will focus on distance-based methods.

Distance-based methods build phylogenetic trees from aligned sequences by mak-

ing use of pairwise “distances” between the sequences. These distances arise from a

model of sequence evolution, or evolutionary model , which encodes certain hypothe-

ses about how sequences evolve, often including the probabilities that one character

at a given site will transition to another, as well as assumptions about how sequences

as a whole then transform, via evolution, into one another. Models of sequence evolu-

tion are needed to address the problem that observed sequences may have experienced

much more change over time that their elements alone might show; by way of analogy,

just as an observation that a light switch is currently off does not alone completely

specify the number of times previously that it had been flipped on and off again. Many

common models of sequence evolution are, in the lingo of mathematicians, given by

a continuous-time Markov model with a substitution rate matrix whose entries are

the probabilities of characters changing to one another.

Each such evolutionary model given by continuous-time Markov chains corre-

sponds to a phylogenetic tree for which each node is a sequence, and the (directed)

edges link sequences which evolve from one to another by data given by substitu-

tion rate matrices. In this way, the phylogenetic tree summarizes the relationships

between the species (or other organisms represented by the sequences) in terms of

common ancestors (nodes) and evolutionary changes via edge lengths (e.g., times

since divergence, number of substitution events, etc.).

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home