Chem2Bio2RDF: a semantic resource for systems chemical biology and drug discovery - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

[13], with access directly through JDBC or ODBC database connections,

or via web service interfaces (in our ChemBioGrid architecture). All of the

data and services resided on the same machine, making access easier. Our

initial version of Chem2Bio2RDF preserved this architecture, keeping the

data in a relational database and using the D2R tool to provide an RDF

SPARQL interface to the relational data set. However, this proved to have

limitations that severely restricted the utility of Chem2Bio2RDF: namely

(1) cross-data set searching is diffi cult to implement; and (2) it is hard to

embed an ontology. So after initial testing we moved to use Virtuoso

Triple Store as our basic representation, which is a true RDF-triple store,

thus allowing all the data to be treated as a graph, enabling easy cross-

data set searching, and permitting the later development of a

Chem2Bio2RDF ontology. For migration, we used D2R to generate RDF

for all the relations in PostgreSQL, and then exported them into the triple

store. We also found that Virtuoso was much more effi cient than the

PostgreSQL/D2R implementation: as D2R eventually searches the data

by SQL, speed of searching is highly dependent on the structure of the

tables and associated indices. Additionally, Virtuoso provides a web

interface for data management, and provides a REST web service allowing

the data to be searched from other endpoints.

Using either D2R or Triple Store provides access to searching using

SPARQL, but this has domain limitations, most notably it does not

provide cheminformatics- or bioinformatics-specifi c searching capabilities

such as similarity searching, substructure searching, or protein similarity

searching. We thus had to extend SPARQL to allow such queries. This

was done using the open source Jena ARQ [14] with cheminformatics

functionality from the Chemistry Development Kit (CDK), ChemBioGrid,

and bioinformatics functionality from BioJava.

18.3.2 Where data is stored

We opted to use a single medium-performance server (a Dell R510 with

four quad-core Xenon processors, 1 TB storage) to store all of the data in

our triple store and also to provide searching capabilities. Thus far, this

has provided good real-time access to the data assuming no more than

two searches are being performed at the same time. Keeping all of the

data in one location (versus federating searches out) has proved useful in

permitting query security (i.e. queries are not broadcast outside our

servers) but a concern is that if we expand Chem2Bio2RDF to new, much

larger data sources such as from Genome Wide Association Searches or

Search WWH ::

Custom Search

Home