CHEMSPIDER: A PLATFORM FOR CROWDSOURCED COLLABORATION TO CURATE DATA DERIVED FROM PUBLIC COMPOUND DATABASES - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

and element lookups in the formula. Other approaches include checking for

stereochemistry in the name but absences of stereochemistry in the structures

and using name-to-structure conversion tools to convert names to structures

and look for ambiguity collisions. Despite these automated approaches being

of value for assisting in the validation of millions of identifi ers, the most rigor-

ous checks, especially in terms of trade names, are from visual inspection by

users of the ChemSpider database and application of online curation tools.

ChemSpider users who wish to assist in curating the data are required to

register on the system in order to police for potential vandalism of the data.

Curators use intuitive approaches to approve and remove identifi ers using a

series of simple check boxes. Each such operation produces an e-mail into a

centralized master curator inbox for further checking by one or more master

curators who can further approve or disallow the suggested validations to the

identifi ers. A full tracking log of all such edits is maintained on the database.

Such curations are made to the database on a daily basis, and the quality of

the validated identifi er dictionary improves incrementally as a result.

As soon as names are validated, they are used afresh to query against the

integrated services associated with a chemical record so that new data will be

retrieved from Pubmed, Google patents, Google scholar, and so on. An exem-

plar of this approach would be that a particular chemical record may have no

associated hits from Pubmed initially, but approval of one or more identifi ers

would then trigger a lookup against the appropriate Web service and imme-

diately retrieve a related hit list. There are risks with these approaches in that

different chemicals can have the same associated identifi ers and users should

be cautious and check the associated data. This case is particularly challenging

for abbreviations though procedures have been instituted to limit such issues

as best as possible. The integration to search against external resources using

identifi ers will be discussed in further detail later in this chapter.

22.2.6.4 Physicochemical Data Physicochemical data play a defi ning role

in the activity of chemical compounds through properties such as log P , log

D , and aqueous solubility, to name only a few. The pharmaceutical industry

uses such properties in their in silico screening approaches via the judicious

application of the Lipinski Rule of Five [70] and other such fi lters. When such

physicochemical data can be sourced as experimental data from databases,

they are captured and listed against the chemical records. Where possible links

are retained to the original sources of the data so that they can be investigated

should there be any questions regarding the validity of the data.

The majority of the ChemSpider database does not have such properties

measured and prediction algorithms are therefore used to predict them. The

list of predicted properties includes boiling point, fl ash point, log P , log D (at

two physiological pHs), number of rotatable bonds, number of proton donors,

number of proton acceptors, and other related properties. The ability to search

the entire database using such properties as fi lters has been enabled, and this

is an excellent way to narrow a particular structure set from a query when, for

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home