Biology Reference
In-Depth Information
2. Benchmarks based on consistency among several alignment
techniques.
3. Benchmarks based on the three-dimensional structure of the
proteins encoded by sequence data.
4. Benchmarks based on knowledge of, or assumption about, the
phylogeny of the aligned biological sequences.
In the remainder of this chapter, we analyze each of these
benchmarking approaches to point out their pros and cons, and
determine how well they satisfy the criteria defined above and
summarized in Table 1 .
2
Simulated Sequences
Given that a major objective of MSA is to identify residues that
evolved from a common ancestor, i.e., to optimize for homology in
the alignment, one approach to benchmarking involves generating
families of artificial sequences by a process of simulated evolution
along a known tree. Such simulation-based approaches adopt a
probabilistic model of sequence evolution to describe nucleotide
substitution, deletion, and insertion rates, while keeping track of
“true” relationships of homology between individual residue posi-
tions. Since these are known, a “true” reference alignment and a
test alignment based on the simulated sequence data, assembled by
a particular MSA tool of choice, can be compared and measures of
accuracy estimated (see below). There are many packages that will
perform simulated sequence evolution, including Rose [ 12 ],
DAWG [ 13 ], EvolveAGene3 [ 14 ], INDELible [ 15 ], PhyloSim
[ 16 ], REvolver [ 17 ], and ALF [ 18 ].
To quantify the agreement between the reconstructed align-
ment and the true alignment (known from the simulation), two
measures of accuracy are commonly employed: the sum-of-pairs
(SP) and the true column (TC) scores [ 19 ]. The former is defined
as the fraction of aligned residue pairs that are identical between the
reconstructed and true alignment, averaged over all pairwise com-
parisons between individual sequences; the latter is defined as the
fraction of correctly aligned columns that are reproduced in the
reconstructed alignment. Given that the TC score considers whole
columns in the alignment as comparable units, a single misaligned
sequence can reduce the TC score to zero. For this reason, when
considering numerous or divergent sequences, the finer-grained SP
score is usually used. Yet even the SP score is not without problems.
For instance, pairwise comparisons ignore correlations among
sequences, meaning that closely related sequences contribute dis-
proportionately more to the SP score than they do to the total
phylogenetic information contained in the alignment; this can be
misleading in phylogenetic applications. More generally, SP and TC
Search WWH ::




Custom Search