Biomedical Engineering Reference
In-Depth Information
can be downloaded and used may require these annotation data to be provided separately.
If you wish to implement your own function prediction application based on GO, you will
also need to obtain the scheme and annotation data files.
9.2.2 Working with multiple protein identifier systems
To predict functions for proteins, we often need to utilize information from more than one
database. For example, to predict functions for proteins from the S. cerevisiae genome,
we may use functional annotations from GO, protein sequences from the Comprehensive
Yeast Genome Database (CYGD) or the SGD [39] and protein-protein interactions from
the Biomolecular Interaction Network Database (BIND) [43] or the Biological General
Repository for Interaction Datasets (BioGRID) [44]. These databases may refer to the same
genes/protein using different identifiers.
The adoption of different naming conventions may stem from various reasons, such as
legacy or the nature of the data being referenced (e.g. sequences versus genes). Nonetheless,
this poses some problems in PFP when we need to combine data from different sources.
Cross-referencing tables are sometimes provided in some of these databases, but these
are often incomplete and not up to date. To address the issues of incompleteness and
redundancy in cross-referencing genes and proteins, resources such as the International
Protein Index (IPI) [45] and the UniProt Universal Protein Resource [46] have been
developed. UniProt provides a unique identifier to every distinct protein sequence, while
IPI provides a unique identifier for every distinct annotated protein. Efforts have also been
made to provide services for cross-referencing genes/proteins between different databases.
Here we briefly describe some of these.
MatchMiner [47] provides a set of tools that translates between the different identi-
fiers of a gene. These include interactive lookup for one gene, batch lookup for multiple
genes and the merging of two lists of genes under different identifier systems to identify
which identifiers refer to the same genes, and can be accessed through the web page at
http://discover.nci.nih.gov/matchminer/index.jsp. A command-line version implemented in
java is also available at http://discover.nci.nih.gov/matchminer/command.jsp. MatchMiner
covers only genes from the H. sapiens (human) and M. musculus (mouse) genomes.
AliasServer [48] provides translation services between the aliases of a protein under
different identifier systems. AliasServer can be accessed through a web interface at http://cbi
.labri.fr/outils/alias/ and also provides a web service that can be accessed via the Simple
Object Access Protocol (SOAP). Details on how to access the web service are provided
at http://cbi.labri.fr/outils/alias/API_SOAP.html, with examples using Perl. At the time of
writing, AliasServer covers genes from 29 genomes.
The Protein Identifier Cross-Referencing (PICR) service [49] is another service for
translation between different gene identifier systems. One distinct feature of PICR
compared with the others described in this section is that the service can take not only
gene identifiers, but also protein sequences as input. PICR also does not require the
user to specify the type of identifier systems to translate between, which may result in
ambiguities when an identifier refers to different proteins in different systems. PICR can be
accessed via http://www.ebi.ac.uk/Tools/picr/, and also provides a web service via SOAP.
Details on using the PICR web service are available at http://www.ebi.ac.uk/Tools/picr/
WSDLDocumentation.do, with examples using the Java API for XML Web Services
(JAX-WS). At the time of writing, PICR covers genes from 47 genomes.
Search WWH ::




Custom Search