Rational Design of Viral Protein Structures with Predetermined Immunological Properties - Structure-Based Study of Viral Replication

Biology Reference

In-Depth Information

quantitative way requires generation of a code to represent the corre-

sponding amino acid types. Data encoding can be very intricate and

greatly impact the performance of the ANN. A common practice is to

encode each one of the 20 letters corresponding to the 20 amino acid

types of a protein into a numerical binary scheme. For instance, each

letter can be represented by a 20-dimensional binary vector, i.e.

20-bin representation. In such case, amino acids are represented

by a set of 19 zeros and a single one uniquely positioned to represent

each amino acid. For example, A

=

00000000000000000001,

00000000000000000100, etc.

Alternatively, a lower dimensional vector can be generated based on

the known physicochemical properties of each amino acid type. Both

schemes have their advantages and disadvantages. One of the major

advantages of binary representation is that very small changes in

amino acid composition between sequences can be easily detected and

mapped by the ANN. In fact, this type of encoding representation has

been successfully used for other tasks, such as clustering aligned pro-

tein sequences. 116 On the other hand, depending on sequence length

and amount of samples in the data, 20-bin representation can greatly

impact the size complexity of the ANN model. The use of large input

layers increases the probability of over training, i.e. the probability

that the ANN learns or memorizes insignificant patterns of the train-

ing data. 111,117 Meanwhile, encoding schemes based on physicochem-

ical properties usually require lower dimensional vectors for amino

acid type representations. 115,118,119 In addition, such schemes allow the

encoding of a variety of features, such as hydrophobicity, volume,

bulkiness, etc. Furthermore, the number of features that can be

encoded does not necessarily impact the size of the vector. Schneider

and Wrede (1998) devised an encoding scheme using eigenvalues for

each amino acid type from principal components derived by principal

component analysis (PCA) of 143 physicochemical scales. 120

Physicochemical values or any combination of features, such as those

shown in Table 1, can be assigned to any amino acid type within a

sequence. Through either binary or physicochemical encoding proce-

dures, protein sequences are translated to numerical vectors for math-

ematical processing by the neural network.

C

=

00000000000000000010, D

=

Search WWH ::

Custom Search

Home