Chemistry Reference
In-Depth Information
relationship between pubchem.substance and pubchem.compounds
handled?
The column pubchem.substance.cid _ associations is taken
directly from the sdf files supplied by PubChem. It has all the neces-
sary information, but it is not in a proper form for a relation between
pubchem.substance and pubchem.compounds . This is because too
much information has been crammed into this column. For example, the
cid _ associations for substance _ id 22 contains the data “449653
1 449655 2 6540406 2”. This means that there are three compound ids asso-
ciated with this substance id. In other words, there is a many-to-many
relationship between compounds and substances. While it would be pos-
sible to parse the cid _ associations column when the compound id is
needed, it is better to have a clear relationship between substance ids and
compounds ids. It is better because it enforces and preserves the relational
integrity (or referential integrity) between these data. It also makes select-
ing data from all three data sources quicker and easier.
Another complexity is that the compound associations are classified
using a small number, for example the 1 and 2 in the cid _ associations
quoted above. These classifications might be called primary when the
number is 1, secondary when 2, etc. Trying to apply a parsing rule for data
encoded in a column is prone to error. There is no easy way to enforce the
format of data in this column.
There are several approaches to creating the relation between com-
pound and substance. One is to create an integer column, say, pubchem.
substance.cid that would contain only the primary compound id from
the column cid _ associations . This column becomes a foreign key
related to the pubchem.compound.cid column. This would form a proper
relation between the tables, but would neglect the secondary cid _ asso-
ciations . If those are of no interest, this approach is an excellent choice.
Another approach is to create multiple columns: one for the primary
compound id and others for the secondary, tertiary, etc. compound ids.
Each of these integer columns could serve as a foreign key and form a
proper relation to the pubchem.compound.cid column. This approach is
not recommended because the maximum number of compound ids in the
cid _ associations column is not known and could increase as more
data is added. In addition, the type of association, primary, secondary,
etc. would have to be neglected, stored in another column, or somehow
encoded in the new column names. This approach has too many draw-
backs to be acceptable.
The proper way to create a relation between pubchem.substance.
substance _ id and pubchem.compound.cid is to create a new table
that acts as an intermediary. This is a typical approach to handling many-
to-many relationships. This table must include a column for the com-
pound id and a column for the substance id. There can be as many rows
Search WWH ::




Custom Search