Comparison of sequences, protein 3D structures and genomes - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

distance of vectors, equation [2]). The optimal structural alignment can be determined by a

dynamic programming algorithm.

A roughly similar approach was used by Holm and Sander for the very popular

DALI server [42]. In the underlying method the C D atoms are characterized by vectors the

parameters of which are the elements of distance matrix. The local vectors are then

compared in terms of residue similarity scores such as

[4]

(

)

[5]

(

)

(

)

The subscript A,B refer to residues in structure d ij are the elements of the

hexapeptide distance matrices i.e. elements of the residue vectors.

i d denotes the average

d and i d , T , T and D are constant. A and B, respectively. Superscript R denotes

rigid comparison [eqn. 4], E refers to an elastic comparison dampened by a negative

exponential term [eqn.5]. As can be seen, summing the residues similarity measures I or

I results in quantities related to the city block distance. Comparison of two proteins A and

B is then carried out using a distance matrix whose elements are equal to either

(

)

I , where i and j refer to two pairs of structurally aligned residues: i(A), i(B), j(A),

and j(B). The optimization task is to find the best set of equivalences between A and B that

maximize this function and the structural alignment is obtained by an optimization

algorithm (Monte Carlo optimization) To improve convergence, various heuristics are used

to obtain a reasonable starting point.

The residue similarity score of Levitt and Gerstein [43] has the formula

(

)

[6]

(

)

where d ij is the distance between C D atoms of the two structures compared, M and d 0 are

constants. S ij values are elements of a similarity matrix from which an optimizeable

substructure similarity measure S str can be calculated by introducing gaps. The S str score is

defined as

(

)

[7]

str

gap

The structural alignment is carried out with a dynamic programming method such as

the Smith-Waterman algorithm. Levitt and Gerstein found that random structural

similarities determined by this method follow the same extreme value distribution as

BLAST scores (or Smith-Waterman sequence alignment scores), so the results can be

characterized in terms of P values [43].

As superposition methods are compute intensive, a number of simplified

representations have been developed. One general strategy is to represent the protein by a

set of secondary structure elements (SSEs), characterized by their position within the

polypeptide sequence and the position in 3D space and are usually represented as vectors fit

to the C D atoms. This is another kind of entity-relationship description in which SSEs are

the nodes and a variety of parameters (such as distances, angles ec) are used to describe

relationships. The rationale is that superposition of a few SSEs is less compute intensive

Essays in Bioinformatics

Search WWH ::

Custom Search

Home