Prediction of Protein Function - Genomics: Essential Methods

Biomedical Engineering Reference

In-Depth Information

can be downloaded and used may require these annotation data to be provided separately.

If you wish to implement your own function prediction application based on GO, you will

also need to obtain the scheme and annotation data files.

9.2.2 Working with multiple protein identifier systems

To predict functions for proteins, we often need to utilize information from more than one

database. For example, to predict functions for proteins from the S. cerevisiae genome,

we may use functional annotations from GO, protein sequences from the Comprehensive

Yeast Genome Database (CYGD) or the SGD [39] and protein-protein interactions from

the Biomolecular Interaction Network Database (BIND) [43] or the Biological General

Repository for Interaction Datasets (BioGRID) [44]. These databases may refer to the same

genes/protein using different identifiers.

The adoption of different naming conventions may stem from various reasons, such as

legacy or the nature of the data being referenced (e.g. sequences versus genes). Nonetheless,

this poses some problems in PFP when we need to combine data from different sources.

Cross-referencing tables are sometimes provided in some of these databases, but these

are often incomplete and not up to date. To address the issues of incompleteness and

redundancy in cross-referencing genes and proteins, resources such as the International

Protein Index (IPI) [45] and the UniProt Universal Protein Resource [46] have been

developed. UniProt provides a unique identifier to every distinct protein sequence, while

IPI provides a unique identifier for every distinct annotated protein. Efforts have also been

made to provide services for cross-referencing genes/proteins between different databases.

Here we briefly describe some of these.

MatchMiner [47] provides a set of tools that translates between the different identi-

fiers of a gene. These include interactive lookup for one gene, batch lookup for multiple

genes and the merging of two lists of genes under different identifier systems to identify

which identifiers refer to the same genes, and can be accessed through the web page at

http://discover.nci.nih.gov/matchminer/index.jsp. A command-line version implemented in

java is also available at http://discover.nci.nih.gov/matchminer/command.jsp. MatchMiner

covers only genes from the H. sapiens (human) and M. musculus (mouse) genomes.

AliasServer [48] provides translation services between the aliases of a protein under

different identifier systems. AliasServer can be accessed through a web interface at http://cbi

.labri.fr/outils/alias/ and also provides a web service that can be accessed via the Simple

Object Access Protocol (SOAP). Details on how to access the web service are provided

at http://cbi.labri.fr/outils/alias/API_SOAP.html, with examples using Perl. At the time of

writing, AliasServer covers genes from 29 genomes.

The Protein Identifier Cross-Referencing (PICR) service [49] is another service for

translation between different gene identifier systems. One distinct feature of PICR

compared with the others described in this section is that the service can take not only

gene identifiers, but also protein sequences as input. PICR also does not require the

user to specify the type of identifier systems to translate between, which may result in

ambiguities when an identifier refers to different proteins in different systems. PICR can be

accessed via http://www.ebi.ac.uk/Tools/picr/, and also provides a web service via SOAP.

Details on using the PICR web service are available at http://www.ebi.ac.uk/Tools/picr/

WSDLDocumentation.do, with examples using the Java API for XML Web Services

(JAX-WS). At the time of writing, PICR covers genes from 47 genomes.

Genomics: Essential Methods

Search WWH ::

Custom Search

Home