Chemistry Reference
In-Depth Information
There is some overhead in the use of indexes, constraints, triggers,
etc. as discussed here. The overhead is incurred when rows are inserted
or updated in the table. However, the value of this approach is that the
data in the table are well validated and can be searched more reliably
and efficiently. Direct lookups of canonical or stereo SMILES is simple
and quick because of the index on these columns. Using the fingerprint
column speeds up substructure search. Tautomers can be readily selected
using the column of simple graphs.
It might be helpful to delay the creation of the indexes when the
schema is first created and its tables populated. This is especially true if
millions of compounds are to be entered at one time. However, if there are
duplicate structures and the table contains even two rows with the same
isosmi , it will not be possible to create a unique index on the isosmi
column until only a unique set of isosmi values exists. The creation of
a unique index does not fix nonunique values. It simply prevents non-
unique values. In order to find duplicate structures in a table, the follow-
ing SQL can be used.
Select isosmiles, count(isosmiles) from structure
group by isosmiles having count(isosmiles) > 1;
The use of the group by clause causes all identical values of isosmiles
to be grouped together and processed by the aggregate function count .
For duplicate values of isosmiles , the count will be greater than 1. The
above SQL statement will select these isosmiles .
13.3 Experimental Chemical and
Biological Data Integration
The first section of this chapter showed how a schema of tables could be
used to create a compound registry. Using that schema, this section will
show how experimental data can be integrated with compound data. A
separate schema will be used to store the experimental data. In fact, several
schemas will be created in order to segregate data tables for separate assays
or projects. This is not essential, but is handy for browsing data tables in a
large database. Schemas are analogous to folders in a file system.
Each set of experimental data will likely have to be considered separately.
Some measurements may require several columns, while other will require
only one. Some measurements may have measured or estimated uncertain-
ties, while others may be more indefinite, such as “active” or “inactive.” All
measurements will have a date associated with them and this should be
stored. Many experimental values will be noninteger numbers. These may
be stored using the numeric data type or float or double, depending on the
accuracy needed and whether computations on these values will be carried
Search WWH ::




Custom Search