A Preliminary Study on the Prediction of Human Protein Functions - Foundations on Natural and Artificial Computation

Information Technology Reference

In-Depth Information

learning all the possible couples of two classes are learned. Thus, with K inde-

pendent classes there are (( K

K ) / 2 binary classification problems.

Traditional learning algorithms used by neural networks, SVMs or decision

trees learn to discriminate instances by taking into account that classes will be

independent. Given that a protein can have more than a particular annotated

function, the traditional learning scheme of the proposed Machine Learning mod-

els is not completely appropriate.

In Multi-Label Learning with neural networks, the traditional sum of squares'

error function has been modified to take into account the correlations between

several target classes [16]. As well as neural networks, decision trees and SVMs

were recently adapted to take into account multi-label learning [8], [15].

−

1)

·

3 The Human Proteome Dataset

The number of human proteins extracted from the UniProtKB/Swiss-Prot

database (UniProtKB release 15.13) was 20'228. An independent testing set

of 44 “newly functionalised” proteins has been extracted from a more recent

version of this database (UniProtKB release 2011 1).

3.1

Input Vectors

The input vectors of the human proteome dataset were generated according to a

number of features pertaining to the protein sequence. Except for protein inter-

actions all the input features result from predictors. Table 1 presents the list of

these features. Each protein is represented by 33'102 input variables. More than

20'000 Boolean input variables are due to protein/protein interactions extracted

from String. InterPro domains represent more than 7'000 inputs, and PROSITE

patterns/profiles give more than 5'000 inputs. As a result, input vectors are

essentially sparse.

3.2 Output Vectors

The target vectors are defined according to the Gene Ontology (GO), which is

a major bioinformatics initiative to unify the representation of gene and gene

product attributes across all species. GO is represented as a directed acyclic

graph covering three distinct sections:

1. Molecular function (“F”)

2. Biological process (“P”)

3. Cellular component (“C”)

GO nodes at the top of the structure are the most general, whereas at the

bottom lie very specific properties. For instance, for the “F” branch we have at

the top “ molecular function ” and several levels below there is “ Hydrolase activity

acting on ester bonds ”, which represents a more specific function of “ Molecular

Funct ion ”.

Foundations on Natural and Artificial Computation

Search WWH ::

Custom Search

Home