Biology Reference
In-Depth Information
differ from BAliBASE in that they are derived by automatic means,
rather than by manual annotation of protein alignments. Reference
sets also exist for RNA structures [ 34 ]. For further discussion of
these datasets, we direct the reader to reviews by Aniba et al. [ 2 ],
Edgar [ 3 ], Kim and Sinha [ 35 ], and Thompson et al. [ 4 ].
Regarding the desirable criterion of independence, although
alignment algorithms incorporating structural aspects of sequence
data do exist, such as Dynalign [ 36 ] and Foldalign [ 37 ]—for a
more exhaustive discussion of RNA structural alignments, see
Gardner et al. [ 34 ]—the parameters that go into constructing
structure-based reference datasets are usually completely detached
from the considerations that go into the development of MSA
workflows.
Despite the degree of confidence structural alignment confers,
it has been observed that sequence alignments used in BAliBASE
and PREFAB are not always consistent with known annotations
from external sources such as the CATH and SCOP databases, thus
calling into question their biological accuracy [ 3 ]. Both manual and
automated structural benchmark construction face considerable
challenges. Manually curated structural benchmarks, while usually
believed to generate more biologically accurate results than auto-
mated procedures, might also introduce subjective bias in the
alignment. Automated procedures ensure reproducibility, but
cannot avoid the existence of debatable parameter choices (e.g.,
the choice of the minimum spatial distance for two residues to be
considered in the same fold) and potential systematic errors.
The nontrivial relationship between structural similarity of resi-
dues and alignment, however, is not the only cause of concern in
structural benchmarks. Specifically, structure superpositions used for
creating structural benchmarks are often not only based on experi-
mentally derived structures, but also on primary sequence-based
procedures such as BLASTP [ 38 ]andNORMD[ 39 ] which them-
selves employ amino acid substitution matrices and gap penalty
scores, and thus make modelling assumptions about the sequences
to be aligned [ 3 ]. If these parameters overlap with the parameters
employed in MSA methods under test, then reference alignments
obtained this way will be biased towards MSA-derived alignments
that used those same parameters.
Problems arising from the use in benchmarking of reference
alignments derived from structural comparisons can partially be
overcome by the direct use of structural measures that are indepen-
dent of any reference alignment. To evaluate the structure super-
position implied by an MSA, Raghava et al. [ 40 ] adopted scores
from a sequence-based multiple structure alignment algorithm
[ 41 ]. Such structure similarity scores approximate the location of
an amino acid in a test alignment by the location of its
-carbon
(backbone carbon to which the amino acid side-chain attaches).
Two aligned amino acid are then compared by the distance between
α
Search WWH ::




Custom Search