Biomedical Engineering Reference
In-Depth Information
have been the basis of platforms to allow crowdsourced analysis, validation,
and annotation of the data. Examples from the world of astronomy are
GalaxyZoo (http://www.galaxyzoo.org/) and MoonZoo (http://www.moonzoo.
org/) while in chemistry the ChemSpider database, coincidentally established
by ChemZoo (http://www.chemspider.com) (see Chapter 22), is the preemi-
nent example. In regards to chemistry the past fi ve years has seen an explosion
in the availability of databases hosting chemical compound collections and
generally accessible via a cheminformatics platform allowing searching by
molecular structure. As a result of these efforts, chemistry information on the
Internet is increasingly becoming much more widely accessible, with numerous
chemical compound databases on the Web providing free access to molecular
structures and related data [15, 16]. However, there are multiple issues: As
previously described [17], these databases generally contain the chemical iden-
tifi ers in the form of chemical names (systematic and trade) and registry
numbers and, due to their assembly in a heterogeneous manner, the data can
be plagued with quality issues and these can impact downstream uses such as
computational modeling. We are aware of many databases that curate all
manner of information that might be of relevance to chemists involved in
biomedical research, from chemical vendor catalogs, to patents, to spectra of
various kinds. A recent article describes the public and commercial databases
of bioactive compounds [18] and concludes that the commercial efforts are
ahead of the public ones at this point in time, yet both are complementary.
5.4.1
PubChem
PubChem, a molecule database, launched in 2004 to support the “New Path-
ways to Discovery” component of the Roadmap for Medical Research [19]
(http://pubchem.ncbi.nlm.nih.gov/), is probably the most widely known and
yet it covers only a small fraction of the chemical universe. At present PubChem
is the informatics backbone for the Molecular Libraries and Imaging Initiative,
which is part of the NIH Roadmap [19]. PubChem presently contains almost
31 million unique structures with biological property information provided for
a fraction of the compounds. Although it is authoritative and built on an excel-
lent informatics platform with a well-resourced infrastructure, there are a
number of constraints and issues with PubChem. Specifi cally, it is a repository
of data and information and does not make any special effort toward curating
the data depending instead on the whims of the depositors to ensure the
quality and validity of the data. As a result any errors in the data deposited
into PubChem may be, and already have been, transferred into other online
databases that treat PubChem as an authority. This in turn can impact the
research of others using computational models. The issues are not limited only
to the validity of the chemical structures but, more generally, to the structure-
identifi er relationships and resulting dictionaries that have been derived from
the data. As a simple example of structure-identifi er errors, examination of
the list of identifi ers associated with the simplest organic molecule in PubChem,
Search WWH ::




Custom Search