Biomedical Engineering Reference
In-Depth Information
[13], with access directly through JDBC or ODBC database connections,
or via web service interfaces (in our ChemBioGrid architecture). All of the
data and services resided on the same machine, making access easier. Our
initial version of Chem2Bio2RDF preserved this architecture, keeping the
data in a relational database and using the D2R tool to provide an RDF
SPARQL interface to the relational data set. However, this proved to have
limitations that severely restricted the utility of Chem2Bio2RDF: namely
(1) cross-data set searching is diffi cult to implement; and (2) it is hard to
embed an ontology. So after initial testing we moved to use Virtuoso
Triple Store as our basic representation, which is a true RDF-triple store,
thus allowing all the data to be treated as a graph, enabling easy cross-
data set searching, and permitting the later development of a
Chem2Bio2RDF ontology. For migration, we used D2R to generate RDF
for all the relations in PostgreSQL, and then exported them into the triple
store. We also found that Virtuoso was much more effi cient than the
PostgreSQL/D2R implementation: as D2R eventually searches the data
by SQL, speed of searching is highly dependent on the structure of the
tables and associated indices. Additionally, Virtuoso provides a web
interface for data management, and provides a REST web service allowing
the data to be searched from other endpoints.
Using either D2R or Triple Store provides access to searching using
SPARQL, but this has domain limitations, most notably it does not
provide cheminformatics- or bioinformatics-specifi c searching capabilities
such as similarity searching, substructure searching, or protein similarity
searching. We thus had to extend SPARQL to allow such queries. This
was done using the open source Jena ARQ [14] with cheminformatics
functionality from the Chemistry Development Kit (CDK), ChemBioGrid,
and bioinformatics functionality from BioJava.
￿ ￿ ￿ ￿ ￿
18.3.2 Where data is stored
We opted to use a single medium-performance server (a Dell R510 with
four quad-core Xenon processors, 1 TB storage) to store all of the data in
our triple store and also to provide searching capabilities. Thus far, this
has provided good real-time access to the data assuming no more than
two searches are being performed at the same time. Keeping all of the
data in one location (versus federating searches out) has proved useful in
permitting query security (i.e. queries are not broadcast outside our
servers) but a concern is that if we expand Chem2Bio2RDF to new, much
larger data sources such as from Genome Wide Association Searches or
 
Search WWH ::




Custom Search