Divide and Conquer Strategies for Protein Structure Prediction - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

(1 node). Finally, the length of the segment connecting i and j is encoded in

11 input units corresponding to sequence separations 6, 7, 8, 9, 10-14, 15-19,

20-24, 25-29, 30-39, 40-49, >49.

(b) This global information includes amino acid composition of the entire protein

(20 units), secondary structure composition (3 units) and protein length (4 units,

lengths 1-60, 61-120, 121-240 and >240).

2.5.3.3

SAM-T06con

This NN contact predictor is included in the protein structure prediction architecture

SAM-T06. The implementation of the contact predictor and its performances have

been described in [ 45 ].

In total, the input encoding of the NN requires 449 units. The local information

(a) is accounted by taking a windows of length five centered in each one of the two

residues. Four distinct paired-residue statistics are used (b) and just the length of the

protein is taken into account as global information (c).

(a) For each position in the two windows, the NN input encodes the amino acids

distribution according to a Dirichlet mixture regularizer [ 46 ] (20 units), the

predicted secondary structure and predicted burial [ 25 ] (13 and 11 units,

respectively). Moreover, the entropy of the amino acids distribution (1 unit

for each window) and the logarithm of the sequence separation between the two

residues (1 unit) are included.

(b) The NN input encodes four paired-residue statistics (1 input unit for three of

them and 2 for the last one). The most simple statistics counts the number of

different pairs observed in the MSA columns corresponding to the two residues.

Other statistics considered are the joint entropy ( 2.9 ), the propensity of contact,

and a mutual information-based statistics ( 2.8 ). For these three last measures,

the logarithm of the rank of the statistic's value is taken into the input, except

for the mutual information for which both the logarithm of the rank and the

exact value are added. The rank of a statistic value is computed as the rank of

the value in the list of values for all pairs of columns.

The propensity for two residue to be in contact is the log odds of a contact

between the residues vs. the probability of the residues occurring independently.

This measure has been slightly modified in order to give more weight to high-

separation with respect to low-separation contacts. Here the mutual information

statistics is introduced by computing its p-value (i.e., the probability of seeing

the observed mutual information by chance). The significance of the mutual

information shows better performances in contact prediction than the statistics

itself, as computed in ( 2.8 ). More detailed information about the propensity of

contact and the mutual information-based statistics can be found in [ 45 ].

(c) The only global information added is the logarithm of the length of the protein

(1 unit).

Search WWH ::

Custom Search

Home