Comparison of sequences, protein 3D structures and genomes - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

than superposing a large number of C D atoms, so one can use algorithms that could not cope

with large atomic detail structures. In addition, SSEs incorporate added knowledge on

molecular geometry. The success of the process depends on i) how secondary structures are

assigned; ii) how the similarity between two secondary structural elements of two proteins

is estimated; iii) how the overall similarity between the two proteins is defined.

Although the SSEs (at least the most common like helices and strands) are clearly

defined, different assignment result from different assignment algorithms [44-46].

Consequently, different representations of the protein structures may arise. A further

problem is which SSE types are considered. Very often a two-states classification is used:

helix, including 3/10 and pi, and stand. There are nevertheless exceptions. Orengo et al.

[44-46], for example, adopt a three-states classification: alpha-helix, 3/10-helix, and strand.

The similarity between secondary structural elements in two proteins is usually

estimated by comparing each pair of SSEs of one protein with each pair of the other. The

3D arrangement of a two secondary structural elements in a protein is usually defined by

their distance, their plane angle, and their torsion. A similarity score can then be computed

for each pair of two secondary structural elements. The resulting matrix of similarity scores

can then be scrutinized with dynamic programming techniques [41,47-49], treated as a

maximum clique problem [50], with pseudo-distance matrices [51], or with cluster analysis

[52]. The alignment of the secondary structural elements is eventually followed by a

superposition of the C D atoms with an initial structural alignment that depends on the

secondary structure alignment. The overall similarity between the two structures can be

then estimated on the basis of the rmsd values [50] of with more sophisticated figures of

merit that considers also the quality of the secondary structure fit.

The fragment-pair approach is also amenable to probabilistic interpretation. The

VAST program of Bryant and coworkers [53,54] provides BLAST-like P significance

values. VAST's elementary unit of comparison is a simplified rmsd score resulting from a

superposition of the endpoints of SSE pairs “trimmed” to the same length. First rmsd values

are converted into log-odds scores using precomputed values of comparison of SSE pairs

from related and unrelated structures, then a combined score S o is calculated from the i best

SSE pairs found to mattch between the query and a database entry. The principle of

converting S o into a P value is similar to that used by BLAST, given in equations. 15-17,

but relies on tabulated statistics, rather then on analytical formulae. Let the probability of

finding a substructure of size i with a score S i t S o be denoted as P(S i t S o ). In VAST, the

value of P(S i t S o ) is estimated as a function of i and S i , using tabulated values resulting from

random comparisons. The expected number E of finding at least one score S i t S o by chance

will also depend on the size of the search space which can be defined as the total number of

possible common substructures of i SSEs between the two proteins, a number denoted by

N i . The equation computed by VAST is then

[8]

(

)

The sum is calculated for all i values using the tabulated P(S i t S o values. Same as

with BLAST, if E is small (e.g. E<0.01) it is also a P value. The method is very fast, due to

the precomputed statistics, and accessible at the NCBI web site.

A variety of other procedures that represent the protein 3-D structure as an ensemble

of secondary structural elements have also been proposed. In Martin's approach [55],

secondary structural elements are given one of the letters of an alphabet that identify the

secondary structure type, direction, length, and solvent accessibility. Two proteins can be

thus compared with the simple Needlemann-Wunsch algorithm. Murthy [56] used dynamic

programming techniques to optimally superpose secondary structural elements.

Essays in Bioinformatics

Search WWH ::

Custom Search

Home