Biomedical Engineering Reference
In-Depth Information
28.4
ONTOLOGIES AND COLLABORATIONS
The biomedical research community and specifi cally those involved in
neglected disease research are generating very large data sets facilitated
through high-throughput screening (HTS) [40-42]. Although large HTS data
sets and low-throughput screening results have become available in the public
domain (PubChem [40, 41], ChEMBL [42], Psychoactive Drug Screening
Program (PDSP) [43, 44], ChemBank [45, 46], Collaborative Drug Discovery
(CDD) [3, 47], and others), these data sets are not well standardized, experi-
mental metadata are poorly annotated, and much of the relevant information
is often only available as free text (particularly in PubChem). This presents
impending informatics challenges for selection of hit compounds and follow-
up studies as well as computational analysis of such data or the development
of predictive models. The lack of established and formal standards to annotate
the publicly available screening data also limits their integration with other
structured data sources such as biological pathways, human disease, or adverse
drug effects. A related challenge is knowledge and data representation.
One way that such data have continued utility and accessibility is through
an ontology [43, 44] (see also Chapters 12 and 21). An ontology is a formal
explicit description of a subject domain (a conceptualization) as classes, indi-
viduals, and their relationship and properties to represent static knowledge
[48]. Ontologies are one of the cornerstones of Semantic Web technologies,
which have been proposed as solutions to data integration problems because
formally defi ned semantics and semantic knowledge representation make it
possible to track data provenance across different data sources that typically
use different descriptions and naming conventions [49]. The lack of a standard-
ized terminology with clear defi nitions can even be a severe issue within an
individual data source, for example, in the case of PubChem, where data from
numerous organizations and various experiments are deposited and which
typically vary by the details that are reported for any data set (screening
experiment), the way the information is reported (how the data and the
experimental details are organized), and the type of results that are reported.
This is despite existing recommendations regarding the types of information
that should be reported for HTS assays [50]. For example, there are thousands
of unique endpoint names deposited in PubChem (as of December 2009 there
were over 12,000), many of which are redundant. Although there are two
endpoints which are required in all deposited assays (except summary assays),
activity outcome and activity score, there is no agreed-upon defi nition of
“active” or “inactive” or how the score is to be calculated. Instead, for each
assay submission the depositor can defi ne a “local” meaning of activity outcome
and activity score. The lack of established standards and a semantic framework
to describe the assay experiment and the reported endpoints poses severe
limitations to computational analyses across multiple data sets and their inte-
gration with other data sources. However, as far as PubChem is concerned,
Search WWH ::




Custom Search