Information Technology Reference
In-Depth Information
unpublished data). SecCons can compare outputs of different secondary structure prediction
programs in one (text or graphical) view. The output files of SecCons were converted to
format of the neural network software successively. Finally this resulted in a data set with
one text file for each of the 6000 proteins, containing both ten predictions and the true
secondary structure.
For each protein the text file was converted to Matlab scripting language to make it
suitable for input in the neural network. In the script the predictions were declared first in a
matrix of normalised numbers ranging between 0 and 1. These indicated the likelihood for
a residue to be in a particular secondary structure state. The neural network would compare
these figures with the target matrix (secondary structure taken from DSSP), which was
declared consecutively.
2.5. The neural network
Next a neural network was programmed in the neural network toolbox of Matlab 6.0. The
network was composed of an input layer (10 units), one layer of hidden units (10 units) and
an output layer (10 units). It uses the standard 'errorsqr' error function from Matlab. The
number of learning iterations for one protein was optimized to 300 iterations to save time
without losing the learning performance of the network. The transfer function of the hidden
layer is the Matlab standard function 'tanh' and for the output layer 'softmax'. The network
was used in an implementation in Matlab 6.0, which was written by Tom Heskes (dept. of
Medical Physics & Biophysics, Nijmegen University).
In a training session one by one the proteins were put through the neural network.
After training of the network on the dataset a weight matrix containing the weights between
hidden and output layer was extracted with the implementation mentioned above. These are
the weights for the concerning secondary structure prediction methods. The higher the
weight, the better the performance of the prediction method.
3. Results
After all data were collected and transformed to Matlab scripts, weights were
assigned to all methods for predicting D-helix, E-sheet and combined prediction of both D-
helix and E-sheet (three different training sessions) on a test set of 1000 randomly assigned
proteins.
Table 1 shows the results of this experiment. It is clear that PREDATOR 2 and PHD
have been assigned the highest weights in comparison to other methods. Careful
observation of the data reveals another remarkable feature: though PHD has a weight of 6.1
for predicting E-sheet and a weight of 10.0 for predicting D-helix it has a weight of 9.4 for
predicting both. One would expect the weight for the prediction of both D-helix and E-sheet
to be lower.
This can be accounted for by the percentages of D-helix and E-sheet residues in the
DSSP database and in the 6000 proteins used in the experiments. The percentage E-sheet
(20.6% in DSSP, 20.8% in our set of 6000 proteins) is much less than the percentage D-
helix in these data sets (38.0% in DSSP, 36.2% in the training set; this explains why the
lower weight for predicting E-sheet is less reflected in the weight of overall prediction of
both D-helix and E-sheet for the method PHD.
Search WWH ::




Custom Search