Biology Reference
In-Depth Information
Feature representation is perhaps the most important aspect of
data pre-processing. When analyzing proteins, each letter in the
molecular string represents an amino acid. Each amino acid has
unique chemical properties associated with specific states or values.
A string of characters can be given biological meaning by substituting
the characters with their respective chemical values. Many hydropho-
bic, 121-125 chemical, 126 physical, 127-130 statistical preference, 131 biologi-
cal 132 and mathematical 133 scales of amino acid attributes have been
used. Because of the large number of possible feature representations,
data pre-processing of biological sequences can be difficult and con-
fusing. The guide at this crucial step in the design and application of
an ANN system should be the problem definition, i.e. determination
of the desired input and output mapping for the specified task or goal.
The selection of appropriate feature representation and encoding
method for a given task limits and specifies the information presented
to the ANN. Furthermore, it establishes the parameters of the struc-
tures and functions that can improve the accuracy of ANN and allows
extraction of statistical consistencies or hidden features in the given
sequences. Also, as with the application of any type of statistical analy-
sis or machine-learning technique, the numerical vectors that result
from the feature representation and encoding method of the molecu-
lar sequences for ANN analysis need to have a relative degree of logic
to conform to the basic premise that vectors of similar sequences be
close together, and vice versa. This is important if vectors are to carry
the biological information of the sequence they represent and main-
tain both the biological uniqueness and diversity that result from the
amino acid composition or sequence length. In contrast, poor feature
representation and inadequate encoding methods can result in inade-
quate vectors, preventing maximal extraction of the statistical features
that connect sequences with their structures and functions.
Many sources and tools for performing the numeric transformation
of biological sequences are available. In addition to published data, an
increasing number of web-based tools can assist researchers in perform-
ing these numerical transformations automatically. The Expert Protein
Analysis System (ExPASy) from the Swiss Institute of Bioinformatics
(http://us.expasy.org/) has several sequence analysis tools and software
Search WWH ::




Custom Search