Applications - Design and Use of Relational Databases in Chemistry

Chemistry Reference

In-Depth Information

There is some overhead in the use of indexes, constraints, triggers,

etc. as discussed here. The overhead is incurred when rows are inserted

or updated in the table. However, the value of this approach is that the

data in the table are well validated and can be searched more reliably

and efficiently. Direct lookups of canonical or stereo SMILES is simple

and quick because of the index on these columns. Using the fingerprint

column speeds up substructure search. Tautomers can be readily selected

using the column of simple graphs.

It might be helpful to delay the creation of the indexes when the

schema is first created and its tables populated. This is especially true if

millions of compounds are to be entered at one time. However, if there are

duplicate structures and the table contains even two rows with the same

isosmi , it will not be possible to create a unique index on the isosmi

column until only a unique set of isosmi values exists. The creation of

a unique index does not fix nonunique values. It simply prevents non-

unique values. In order to find duplicate structures in a table, the follow-

ing SQL can be used.

Select isosmiles, count(isosmiles) from structure

group by isosmiles having count(isosmiles) > 1;

The use of the group by clause causes all identical values of isosmiles

to be grouped together and processed by the aggregate function count .

For duplicate values of isosmiles , the count will be greater than 1. The

above SQL statement will select these isosmiles .

13.3 Experimental Chemical and

Biological Data Integration

The first section of this chapter showed how a schema of tables could be

used to create a compound registry. Using that schema, this section will

show how experimental data can be integrated with compound data. A

separate schema will be used to store the experimental data. In fact, several

schemas will be created in order to segregate data tables for separate assays

or projects. This is not essential, but is handy for browsing data tables in a

large database. Schemas are analogous to folders in a file system.

Each set of experimental data will likely have to be considered separately.

Some measurements may require several columns, while other will require

only one. Some measurements may have measured or estimated uncertain-

ties, while others may be more indefinite, such as “active” or “inactive.” All

measurements will have a date associated with them and this should be

stored. Many experimental values will be noninteger numbers. These may

be stored using the numeric data type or float or double, depending on the

accuracy needed and whether computations on these values will be carried

Search WWH ::

Custom Search

Home