Divide and Conquer Strategies for Protein Structure Prediction - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

when the position is outside the N/C-terminal region (1 if outside and 0 if not)

and 1 unit accounts for the conservation weight at that position (see below for

definition). The output of the first level NN consists of three nodes, one for each

possible secondary structure element helix/strand/coil, corresponding to the state

of the central residue in the window. The first level NN classifies (13-residues

long) protein segments according to the secondary structure class of their cen-

tral residue. This classification does not reflect the fact that different segments

can be correlated, being, for example, consecutive and overlapping in the protein

sequence. Particularly, at this level, the NN has no knowledge of the correlation

between secondary structure elements. For example, it has no way to know that

a helix consists of at least three consecutive elements.

2. The second level is introduced to take into account the correlation between

consecutive secondary structure elements. The input of the second level NN is

compiled from the output of the first level NN. For every residue position, the in-

put unit encodes a window of 17 consecutive elements taken from the secondary

structure prediction of the first NN. Every position in the window is encoded

with 5 units: three for the predicted secondary structure, one to detect whether

the position is outside the boundaries of the protein and one for the conservation

weight. The output is set as in the first NN and, also in this case, corresponds to

the state of the central residue in the window.

3. The consensus is a simple arithmetic average over (typically four) differently

trained networks. The highest value of the three output units is taken as the final

prediction. To every such prediction, a reliability index can be associated with

the following formula

RI

Dd 10 .o 1 o 2 / e ;

(2.4)

where o 1 and o 2 are the highest and the second highest values in the output vec-

tor, respectively. The prediction obtained is finally filtered (with the help of the

reliability index) in order to fix some eventually unrealistic local predictions that

neither the second level NN nor the consensus were able to detect (particularly,

too short alpha-helix segments).

The conservation weight provides a score for positions in the MSA with respect

to their level of conservation: the more conserved is a position the higher is the

conservation weight score. Such a weight is contained in the HSSP database and it

is defined by

P r;s D 1 w rs sim rs

P r;s D 1

CW i D

(2.5)

w rs

with

1

100 ident rs ;

where N is the number of sequences in the multiple alignment, ident rs is the per-

centage of sequence identity (over the entire length) of sequences r; s and sim rs

is the value of the similarity between sequences r; s at position i according to the

Dayhoff similarity matrix [ 8 ].

w rs

D 1

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home