Biomedical Engineering Reference
In-Depth Information
Unfortunately, producing the haystacks of data has turned out more
straightforward than knowing how to sift through them, or even knowing
if the needles are there or how to identify them when they are found. The
2000s saw this question move beyond pharmaceutical companies into the
public arena, as huge volumes of data, particularly about chemical
compounds and their biological activities, became available in the public
domain, including such resources as PubChem, ChemSpider, and ChEMBL.
The needle-in-a-haystack analogy hides a subtler issue: that although
most public data sets are centered on chemical or biological entities
(compounds, genes, and so on), the most useful insights often lie in the
relationships between these entities. Although some sets, such as
ChEMBL, do represent a constrained set of relationships (e.g. between
compounds and targets), these are not networked to other kinds of
relationships, and so wider patterns cannot be seen. Yet it is these patterns
that must be key to understanding the systematic effects of drugs on the
body. A more appropriate analogy than haystacks is the Ishihara color
blindness test, in which to fi nd the hidden patterns one has to look at the
whole picture with the right set of lenses.
It is with this in mind, that a research project at Indiana was developed
to prototype new ways of representing publicly available entities and
relationships as large-scale integrated sets, and new ways of data-mining
them (new lenses in the analogy) to reveal the hidden patterns. Since this
research began in 2005, many new relevant technologies have come to the
fore (particularly in the area of the Semantic Web), but the problems have
remained the same: fi nding ways to integrate public data sets intelligently;
providing a common access and computation interface; developing tools
that can fi nd patterns across data sets; and developing new methodologies
that make these tools applicable in real drug discovery problems.
Our initial work involved the development of ChemBioGrid [1], an
open infrastructure of web services and computational tools for drug
discovery, operating at the interface of cheminformatics and
bioinformatics. This infrastructure allowed the quick development of
new kinds of tools that integrated both cheminformatics and
bioinformatics applications. However, this did not address the problem
of data integration, which is addressed by the topic of this chapter,
Chem2Bio2RDF. Data integration is a diffi cult problem [2, 3], as by
defi nition it involves heterogeneous data sets that have often been
developed from different disciplines and groups, each with their own
terminology and ways of representing data. Traditionally, integration has
been achieved using relational databases, a tortuous manual process that
involves complex, formalized relational schema. The new technology
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search