Secondary Structure Classification of Isoform Protein Markers in Oncology - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

3.5

Experimental Results

In this implementation, the amino acid sequences of any protein is subdivided into

segments of 13 amino acids. Each amino acid is coded as a five-bit string, and num-

bered from 1 to 20, so that each pattern vector is composed of 65 binary elements

and the 66th element is assigned the class label of the amino acid corresponding

to the class of the median one of the segment, element number 7. The training set

used is indicated by the protein data bank ( PDB )[ 3 ], which defines a pattern vector

corresponding to that chosen segment of the amino acid sequence of the protein.

No multiple alignment information is included. In the subsequent segment, 13 con-

secutive amino acids are considered, starting from the second one of the preceding

segment and adding as a 13th segment the amino acid subsequent to the final one

of the immediately previous segment defined. Two consecutive patters differ in the

first and last element. Formally, a window of 13 amino acids is considered, and each

pattern is formed by shifting the window of one position. Particular techniques are

applied to initialise and terminate the patterns of a protein, and the class assigned

to each pattern is always the folding class belonging to the seventh element in the

pattern [ 5 ].

Consider all the sequences of the proteins which are included in a training set and

compare them pairwise to determine the number of alignment amino acids common

to the two proteins. An appropriate procedure is used to obtain the largest number

of aligned amino acids by sliding the two sequences up and down and also inserting

pieces of the string, according to strict rules [ 3 ]. For the similarity classification of

the proteins in the training set, the largest alignment value is determined from the

percentage of amino acids aligned between all proteins in the training set, and eight

convenient classes of similitude are defined by setting suitable intervals of alignment

percentage values. In Table 3.1 , the similarity classes are shown together with the

percentage interval of alignment scores or similitude which indicates interval of the

largest percentage value of alignment of the protein in the training set.

For the purpose of this analysis, without loss of generality, proteins belonging

to an isoform class are defined as proteins belonging to similarity class 7. This is

taken as a necessary condition but is not a sufficient condition, since isoforms may

have very different similarity, in which case the markers can be easily identified by

traditional methods. Here, it is important to determine isoforms of proteins, which

Tabl e 3. 1 Similarity classes

and percentage similarity

among proteins

Similarity class

Similitude

0

<0.30

1

0.30-0.40

2

0.40-0.50

3

0.50-0.60

4

0.60-0.70

5

0.70-0.80

6

0.80-0.90

7

>0.90

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home