Chemistry Reference
In-Depth Information
offers finding fragments with carbon chains of varying length. This can be useful for the
exploration of biochemical reactions where this length is less important. [ 16 ]
Interestingly, the four fragment miners mentioned above have been made available as
a single package named ParMol (Parallel Molecular Mining). [ 17 ] In addition to uniform
access to MoFa, gSpan, FFSM and Gaston, the authors included a 2D viewer for molecular
structures, parallel (multiprocessor) search and support for several file formats such as
SMILES and SDF and a number of options to customize mining.
Other algorithms for frequent fragment mining that are more database-centric include
Molfea [ 18 ] and Warmr. [ 19 ] Molfea (Molecular Feature Miner) [ 18 ] is in essence an inductive
database framework. It finds patterns based on first-order logic. Molecules are encoded as
basic facts and queries result in a combination of facts. The fragments that can be searched
for or result from queries are linear sequences of non-hydrogen atoms and bonds. The fact
that Molfea only finds chains of atoms limits its usefulness since almost all molecules have
rings or branching points. Warmr [ 19 ] is a general-purpose Inductive Logic Programming
(ILP) data-mining tool for finding frequently occurring patters in relational data. [ 20 ] [ILP
is a machine learning technique used for knowledge discovery. The purpose of ILP is
hypothesis generation, given some background knowledge and a set of positive and neg-
ative examples. Examples and background knowledge are encoded as a facts and rules
in a relational database. From this, possible hypotheses are generated through inductive
learning. Logic programming is used to represent examples, background knowledge and
hypotheses, in a uniform way.] ILP has been successfully applied to chemical data, for
instance to find frequent substructures in carcinogenic compounds. First, molecules are
described in a relational language. Atoms are related to molecules and to other atoms
through bonds. Algorithms such as Warmr perform multi-relational data mining, which
means they are capable of finding patterns that span across multiple relations. Warmr
searches the available patterns in a breadth-first manner, starting from the most general
relations and gradually increasing the level of complexity, to find patterns that are more
specific. Candidates that are more specific are generated by pruning nonfrequent patterns
from the next level. Several meaningful relationships were reported for application of ILP
on toxicity data. Although Warmr should be able to produce identical results compared
with the fragment miners, it inherits some of the drawbacks related to ILP. First, a high
level of expertise is required to encode the molecules, i.e. the graph and their properties,
into relations that can be mined. Second, the complexity of relations queried, places high
demands on computing resources [ 19 ]
Common substructures. Fragments are also derived by comparing molecular structures.
For a pair of molecules, a number of substructures/fragments may exist that occur in both
structures. A 'common substructure' is a set of atoms that two molecules have in common.
Corresponding atoms should have the same atom type and the same topological distance to
other common atoms, in both molecules. The topological distance is the number of bonds
that form the shortest path between two atoms. The 'maximum common substructure'
(MCS) is a continuously bonded substructure that has the highest number of common
atoms. [ 21 ] Note that there may be multipleMCSs for a pair of molecules. Figure 8.5 shows an
example of the MCS of two molecules, of which the largest is the molecule from Figure 8.1
The 'highest scoring common substructure'(HSCS) [ 21 ] is similar to theMCS, but also allows
discontinuous common substructures. Scores are based on the number of common atoms
Search WWH ::




Custom Search