Biomedical Engineering Reference
In-Depth Information
(1 node). Finally, the length of the segment connecting i and j is encoded in
11 input units corresponding to sequence separations 6, 7, 8, 9, 10-14, 15-19,
20-24, 25-29, 30-39, 40-49, >49.
(b) This global information includes amino acid composition of the entire protein
(20 units), secondary structure composition (3 units) and protein length (4 units,
lengths 1-60, 61-120, 121-240 and >240).
2.5.3.3
SAM-T06con
This NN contact predictor is included in the protein structure prediction architecture
SAM-T06. The implementation of the contact predictor and its performances have
been described in [ 45 ].
In total, the input encoding of the NN requires 449 units. The local information
(a) is accounted by taking a windows of length five centered in each one of the two
residues. Four distinct paired-residue statistics are used (b) and just the length of the
protein is taken into account as global information (c).
(a) For each position in the two windows, the NN input encodes the amino acids
distribution according to a Dirichlet mixture regularizer [ 46 ] (20 units), the
predicted secondary structure and predicted burial [ 25 ] (13 and 11 units,
respectively). Moreover, the entropy of the amino acids distribution (1 unit
for each window) and the logarithm of the sequence separation between the two
residues (1 unit) are included.
(b) The NN input encodes four paired-residue statistics (1 input unit for three of
them and 2 for the last one). The most simple statistics counts the number of
different pairs observed in the MSA columns corresponding to the two residues.
Other statistics considered are the joint entropy ( 2.9 ), the propensity of contact,
and a mutual information-based statistics ( 2.8 ). For these three last measures,
the logarithm of the rank of the statistic's value is taken into the input, except
for the mutual information for which both the logarithm of the rank and the
exact value are added. The rank of a statistic value is computed as the rank of
the value in the list of values for all pairs of columns.
The propensity for two residue to be in contact is the log odds of a contact
between the residues vs. the probability of the residues occurring independently.
This measure has been slightly modified in order to give more weight to high-
separation with respect to low-separation contacts. Here the mutual information
statistics is introduced by computing its p-value (i.e., the probability of seeing
the observed mutual information by chance). The significance of the mutual
information shows better performances in contact prediction than the statistics
itself, as computed in ( 2.8 ). More detailed information about the propensity of
contact and the mutual information-based statistics can be found in [ 45 ].
(c) The only global information added is the logarithm of the length of the protein
(1 unit).
Search WWH ::




Custom Search