Information Technology Reference
In-Depth Information
where
d
is the distance between each of the
N
pairs of equivalent atoms in two optimally
superposed structures. For the calculation of
rmsd
a range of alignment has to be defined
within which the matching of atoms (establishment or equivalent atoms within the two
structures) is determined which is a computationally much harder problem than the
alignment of sequences in one dimension. Once the equivalence of atoms is established, the
optimal superposition has to be found which is carried out by such straightforward
algorithms as that of Kabsch [26].
If the equivalences are fixed, then rmsd can be considered as a simple distance that
can be computed with a straightforward algorithm. This is the case for instance when one
compares different conformations of the same protein such as produced by NMR methods.
In this case the equivalences of the atoms are
a priori
known, since each conformation
consists of the same atoms. The
rmsd
is 0 for identical structures (identical conformations)
while its value increases as the two structures become more divergent. In fact
rmsd
values
are considered as reliable indicators structural variability when applied to very similar
proteins (say
rmsd
< 5-6 A). But even in this case, the
rmsd
value obviously depends on the
number of residues
N
included in the structural alignment. A statistical analysis of a large
number of structures showed that the dependence can be described as:
N
[7]
rmsd
rmsd
(
ln
)
100
100
where rmsd
100
is a constant, an
rmsd
value standardized to 100 residues [27]. The
rmsd
values also depend on the crystallographic resolution, which is more difficult to take into
consideration (Carugo, 2002). As a result,
rmsd
does not behave as a metric distance for
divergent structures so it cannot be used in itself for automated clustering. Clearly, an
rmsd
value of, say 3 Å has a different significance for proteins of 500 residues and for those of
50 residues, so e.g. the structural variability of fold types can not be easily compared
(
rmsd
100
on the other hand may be useful for such comparisons[27]). In other terms,
rmsd
is
a good indicator for structural identity, but less so for structural divergence.
The algorithms for calculating
rmsd
are beyond our scope, the reader is referred to
recent reviews [28]. The philosophy of the calculation depends on whether or not the
alignment, i.e. the equivalences between residues (represented as C
D
atoms) are known. If
yes, the very popular algorithm of [26] and McLachlan (1978) can be used. If this is not the
case, and when the two 3-D models that are compared are too different, there are two
alternatives. Either a partial alignment is available or no a priori assumptions can be made.
In the first case, few equivalences between atom pairs are assumed and they are extended
(and some time rejected) through dynamic programming techniques [29]. In the other case
an exhaustive search is performed by rotating and translating a 3-D model over the other in
a six-dimensional way Diedrichs, 1995).
It has to be noted that superposition of divergent protein 3-D structures is often a
quite arbitrary exercise and various superposition algorithms may lead to completely
different results. An effective, recently proposed procedure to reconcile different structural
alignment procedures consists in an iterative reduction of the number of aligned C
D
atom
pairs [30]. After each superposition, the worse pair is eliminated and a new superposition is
performed leading, eventually, to the identification of the protein core that shows a
significant degree of similarity.
Finally we mention that the
rmsd
distance does not allow the costing of gaps. For
this reason, it can not be used directly for finding an optimum alignment between two
arbitrary proteins.