Biology Reference
In-Depth Information
Table 1
The advantages and risks of the four approaches to MSA benchmarking. Examples are given of
relevant software packages, benchmark databases and tests
Approach
Advantages
Risks
Examples
References
Simulation-
based
Solvability: “true”
homology is known
Relevance: simulated data
might strongly differ
from real biological data
Rose
[
12
]
Evolving: different
scenarios can be
modelled
Independence: MSA
parameters might
resemble those used
in simulation
DAWG
[
13
]
Scalability: new data
can be generated
ad libitum
EvolveAGene3 [
14
]
iSGv2.0
[
48
]
INDELible
[
15
]
PhyloSim
[
16
]
ALF
[
18
]
Consistency-
based
Scalability: not
constrained to
a particular
reference set
Relevance: consistent MSA
methods may be
collectively biased
MUMSA
[
26
,
49
]
Accessibility: tests are
easy and quick
Independence: similar
scores might be used
in MSA inference
HoT
[
27
]
Structure-
based
Relevance: closely
matches a major
biological objective
of MSA
Relevance: limited to structurally
conserved regions; biological
objective of MSA may vary
HOMSTRAD [
10
,
30
]
Independence: empirical
data is used as input
Scalability: only applicable
to small subset of protein
sequences
OXBench [
40
]
PREFAB [
33
]
SABMARK [
32
]
BAliBASE 3.0 [
11
,
31
]
STRIKE
[
50
]
Phylogeny-
based
Relevance: closely
matches a major
biological objective
of MSA
Relevance: biological objective
of MSA may vary from
phylogenetic reconstruction
Species-tree
discordance
test
[
44
]
Independence: empirical
data is used as input
Minimum
duplication
test
[
44
]
Scalability: broad array
of sequence data can
be used as input
are not proper metrics (they do not satisfy the conditions of sym-
metry or triangle inequality), which has motivated the recent devel-
opment of better-founded alternatives [
20
].
Besides the advantage of knowing the true alignment, the fact
that the parameters for simulated sequence evolution are user-
defined directly translates into great flexibility to address specific
questions or to investigate the effect of
individual
factors in