Biology Reference
In-Depth Information
quantitative way requires generation of a code to represent the corre-
sponding amino acid types. Data encoding can be very intricate and
greatly impact the performance of the ANN. A common practice is to
encode each one of the 20 letters corresponding to the 20 amino acid
types of a protein into a numerical binary scheme. For instance, each
letter can be represented by a 20-dimensional binary vector, i.e.
20-bin representation. In such case, amino acids are represented
by a set of 19 zeros and a single one uniquely positioned to represent
each amino acid. For example, A
=
00000000000000000001,
00000000000000000100, etc.
Alternatively, a lower dimensional vector can be generated based on
the known physicochemical properties of each amino acid type. Both
schemes have their advantages and disadvantages. One of the major
advantages of binary representation is that very small changes in
amino acid composition between sequences can be easily detected and
mapped by the ANN. In fact, this type of encoding representation has
been successfully used for other tasks, such as clustering aligned pro-
tein sequences. 116 On the other hand, depending on sequence length
and amount of samples in the data, 20-bin representation can greatly
impact the size complexity of the ANN model. The use of large input
layers increases the probability of over training, i.e. the probability
that the ANN learns or memorizes insignificant patterns of the train-
ing data. 111,117 Meanwhile, encoding schemes based on physicochem-
ical properties usually require lower dimensional vectors for amino
acid type representations. 115,118,119 In addition, such schemes allow the
encoding of a variety of features, such as hydrophobicity, volume,
bulkiness, etc. Furthermore, the number of features that can be
encoded does not necessarily impact the size of the vector. Schneider
and Wrede (1998) devised an encoding scheme using eigenvalues for
each amino acid type from principal components derived by principal
component analysis (PCA) of 143 physicochemical scales. 120
Physicochemical values or any combination of features, such as those
shown in Table 1, can be assigned to any amino acid type within a
sequence. Through either binary or physicochemical encoding proce-
dures, protein sequences are translated to numerical vectors for math-
ematical processing by the neural network.
C
=
00000000000000000010, D
=
Search WWH ::




Custom Search