Information Technology Reference
In-Depth Information
(Esum-squared). The network attempts to minimize this squared error by adjusting weights
and biases.
In the experiments described in this report a simple feed forward neural network
was used. The network had a different function from a consensus classifier. Instead it was
used to assign weights to secondary structure prediction methods.
2. Materials and methods
2.1. Preparation of the datasets
This suggests In order to attain the goal of improving the consensus prediction feature of
Seccons, a test set of training data for the neural network was built first. This data set was
composed of 6000 proteins. For each protein predictions by ten different secondary
prediction methods and the verified secondary structure were collected. The true secondary
structure of a protein was extracted from the PDB by Kabsch & Sander's DSSP program
[27]. This is to allow the neural network to learn from predictions in the data set by
comparing them with the true secondary structure. The sequences of the proteins were also
taken from the DSSP files and used as input for the prediction programs. Al sequences in
the dataset had a length of at least 25 residues. Also sequences with errors were excluded
from the dataset.
A second data set was made from a selection of proteins, which complied with the
following terms:
1.
The protein was added to the database after the programs were released (1997). This
was checked using the local SRS (Sequence Retrieval Server) database server.
2.
The protein is not similar to other proteins in the database (less than 30 percent
sequence homology). To verify this the PDBSelect algorithm was used [28]. The
algorithm picked structures from the PDB and used the program WHAT IF [29] to
do pairwise alignments. If there was a match higher than 30%, the structure with the
lower resolution was removed from the list.
3.
The protein is present in the aforementioned data set of 6000 proteins.
These criteria rendered a data set of 301 proteins.
2.2. Creating the target output in the data files
As mentioned before the target files, which contained the verified secondary structure, were
taken from DSSP files. The definition of secondary structure itself differs in the number of
defined secondary structure states. In DSSP for instance, the states coil (C) (or turn (T)),
bend (S), 3-10 helix (G), short beta bridge (B) and pi helix (I) are also known, besides the
structure elements D-helix (H), E-sheet (E).
Furthermore, some of the secondary structure prediction programs used for
predicting secondary structure also predict the secondary structure elements coil or turn,
while others only predict the elements D-helix and E-sheet. Because the other programs do
not have this feature it was left out of the predictions. The states viewed in this report are
reduced to a-helix, b-sheet and "other". Therefore the states 3-10 helix (G) and pi helix (I)
in the DSSP file were converted to D-helix in the target sequence of the data files. Also the
short beta bridge (B) element from DSSP was translated to b-sheet in the target sequences.
This conversion was performed automatically by the SecCons program (see below), which
also converted the DSSP elements bend (S) and (s) to Turn (T) and Coil (C) respectively.
Search WWH ::




Custom Search