Chemistry Reference
In-Depth Information
By fragmenting 250 251 compounds from the NCI database, they found 65 612 fragments
of the three different types of ring systems, side-chains and linkers. This already yielded
useful information, for instance which ring systems occur and which do not, i.e. finding an
N 6 -ring to be nonexistent may complement some chemical commonsense. In total, 13 509
ring systems were found, 18 015 side-chains, 9675 linkers with two ring systems, 2531
linkers with three ring systems and 2280 linkers with four or more ring systems (up to 18
ring systems). In general, larger ring systems or branches occurred less frequently. Almost
70% of the three types of fragments occurred only once in the database. Branches with a
higher number of attachment points seemed to have lower abundance. An exception to this
rule was formed by linkers with six, or multiples of six, attachment points. These linkers
occurred much more frequently than their neighbors did. Inspection revealed that these
linkers were symmetrical.
The co-occurrence of fragments was also analyzed, to see whether the occurrence of one
fragment in a molecule is related to the occurrence of another. This type of analysis can
be compared to studying the contents of a shopping basket in a supermarket, a so-called
Market Basket Analysis. Wine and olives may be frequently bought together, as are beer
and potato chips, whereas beer and olives might be rarely observed together. Market Basket
Analysis is a data-mining tool for finding regularities in the shopping behavior of customers
of supermarkets, online shops, etc. A stochastic experiment was conducted first, since for
frequently occurring fragments the chance is higher that a relationship is found, even if
there is none. A new 'NCI' database was simulated using fragments that occurred in 20 or
more molecules. Each fragment was used as many times as it occurred in molecules of the
real NCI. Fragments were randomly divided over virtual molecules in the new database
and each combination was counted. This process was repeated 1000 times, after which
the expected occurrence of each fragment pair was calculated, together with the standard
deviation of the occurrence. The expected occurrences were compared with actual co-
occurrences in the NCI. A significant difference between the simulated/expected and the
real co-occurrence implies that the fragments are correlated. The z -values were calculated
and compared to detect that correlation.
Table 8.2 presents some examples of fragment pairs that occur in the samemoleculemuch
more or much less frequently than expected. In the first row of Table 8.2, tetrahydrofuran
andaCH 2 OH group are together; they were expected to occur 122 times together, but
do so much more frequently in 2292 molecules. This is 19 (2292/122) times more than
expected and very significantly different ( z -value of 206) from the simulated database. The
explanation is that the combination is found in (substituted) nucleosides that have been
tested for anti-tumor activity. The second row presents another example of frequently co-
occurring fragments that present a single structure class, viz. dihydrocholesterol analogues.
Interestingly, the situation is the opposite for the combination of a tetrahydrofuran and
a phenyl group expected to occur in 2653 molecules. However, in the NCI there are
only 270 of such instances, a factor of
0.10 (270/2653). Apparently, this combination
is underrepresented. Apossible explanation for this effect might be that the 'avoiding' frag-
ments belong to different compound classes with little overlap. Typical members from one
class will be abundant in that class and scarce in others, adding to an overall reduction
in co-occurrence frequency. Similarly, typical members from the same class are prone to
be found together. Tetrahydrofuran-containing compounds generally differ in origin from
Search WWH ::




Custom Search