Information Technology Reference
In-Depth Information
learning all the possible couples of two classes are learned. Thus, with K inde-
pendent classes there are (( K
K ) / 2 binary classification problems.
Traditional learning algorithms used by neural networks, SVMs or decision
trees learn to discriminate instances by taking into account that classes will be
independent. Given that a protein can have more than a particular annotated
function, the traditional learning scheme of the proposed Machine Learning mod-
els is not completely appropriate.
In Multi-Label Learning with neural networks, the traditional sum of squares'
error function has been modified to take into account the correlations between
several target classes [16]. As well as neural networks, decision trees and SVMs
were recently adapted to take into account multi-label learning [8], [15].
1)
·
3 The Human Proteome Dataset
The number of human proteins extracted from the UniProtKB/Swiss-Prot
database (UniProtKB release 15.13) was 20'228. An independent testing set
of 44 “newly functionalised” proteins has been extracted from a more recent
version of this database (UniProtKB release 2011 1).
3.1
Input Vectors
The input vectors of the human proteome dataset were generated according to a
number of features pertaining to the protein sequence. Except for protein inter-
actions all the input features result from predictors. Table 1 presents the list of
these features. Each protein is represented by 33'102 input variables. More than
20'000 Boolean input variables are due to protein/protein interactions extracted
from String. InterPro domains represent more than 7'000 inputs, and PROSITE
patterns/profiles give more than 5'000 inputs. As a result, input vectors are
essentially sparse.
3.2 Output Vectors
The target vectors are defined according to the Gene Ontology (GO), which is
a major bioinformatics initiative to unify the representation of gene and gene
product attributes across all species. GO is represented as a directed acyclic
graph covering three distinct sections:
1. Molecular function (“F”)
2. Biological process (“P”)
3. Cellular component (“C”)
GO nodes at the top of the structure are the most general, whereas at the
bottom lie very specific properties. For instance, for the “F” branch we have at
the top “ molecular function ” and several levels below there is “ Hydrolase activity
acting on ester bonds ”, which represents a more specific function of “ Molecular
Funct ion ”.
 
Search WWH ::




Custom Search