Concepts of Similarity in Bioinformatics - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

where d is the distance between each of the N pairs of equivalent atoms in two optimally

superposed structures. For the calculation of rmsd a range of alignment has to be defined

within which the matching of atoms (establishment or equivalent atoms within the two

structures) is determined which is a computationally much harder problem than the

alignment of sequences in one dimension. Once the equivalence of atoms is established, the

optimal superposition has to be found which is carried out by such straightforward

algorithms as that of Kabsch [26].

If the equivalences are fixed, then rmsd can be considered as a simple distance that

can be computed with a straightforward algorithm. This is the case for instance when one

compares different conformations of the same protein such as produced by NMR methods.

In this case the equivalences of the atoms are a priori known, since each conformation

consists of the same atoms. The rmsd is 0 for identical structures (identical conformations)

while its value increases as the two structures become more divergent. In fact rmsd values

are considered as reliable indicators structural variability when applied to very similar

proteins (say rmsd < 5-6 A). But even in this case, the rmsd value obviously depends on the

number of residues N included in the structural alignment. A statistical analysis of a large

number of structures showed that the dependence can be described as:

[7]

rmsd

(

)

100

where rmsd 100 is a constant, an rmsd value standardized to 100 residues [27]. The rmsd

values also depend on the crystallographic resolution, which is more difficult to take into

consideration (Carugo, 2002). As a result, rmsd does not behave as a metric distance for

divergent structures so it cannot be used in itself for automated clustering. Clearly, an rmsd

value of, say 3 Å has a different significance for proteins of 500 residues and for those of

50 residues, so e.g. the structural variability of fold types can not be easily compared

( rmsd 100 on the other hand may be useful for such comparisons[27]). In other terms, rmsd is

a good indicator for structural identity, but less so for structural divergence.

The algorithms for calculating rmsd are beyond our scope, the reader is referred to

recent reviews [28]. The philosophy of the calculation depends on whether or not the

alignment, i.e. the equivalences between residues (represented as C D atoms) are known. If

yes, the very popular algorithm of [26] and McLachlan (1978) can be used. If this is not the

case, and when the two 3-D models that are compared are too different, there are two

alternatives. Either a partial alignment is available or no a priori assumptions can be made.

In the first case, few equivalences between atom pairs are assumed and they are extended

(and some time rejected) through dynamic programming techniques [29]. In the other case

an exhaustive search is performed by rotating and translating a 3-D model over the other in

a six-dimensional way Diedrichs, 1995).

It has to be noted that superposition of divergent protein 3-D structures is often a

quite arbitrary exercise and various superposition algorithms may lead to completely

different results. An effective, recently proposed procedure to reconcile different structural

alignment procedures consists in an iterative reduction of the number of aligned C D atom

pairs [30]. After each superposition, the worse pair is eliminated and a new superposition is

performed leading, eventually, to the identification of the protein core that shows a

significant degree of similarity.

Finally we mention that the rmsd distance does not allow the costing of gaps. For

this reason, it can not be used directly for finding an optimum alignment between two

arbitrary proteins.

Essays in Bioinformatics

Search WWH ::

Custom Search

Home