Chemistry Reference
In-Depth Information
steps is exponentially proportional to graph size, which contributes to the inefficiency of
an algorithm. Therefore, most algorithms seek ways to avoid graph/subgraph isomorphism
tests as much as possible.
Starting from an empty fragment, all possible fragment extensions (refinements) are
generated, a process that will be explained below for the simple amino acid alanine. This
is done by recursively adding edges and nodes to already generated fragments. In case
of a ring closure, only an edge is added. Generated fragments are compared against the
graphs in the database to check whether they occur. New refinements can only appear
in those graphs that already hold the original fragment. Accordingly, the algorithm keeps
appearance lists to restrict isomorphism testing to the graphs in the lists only. The support
for a fragment is the proportion, or percentage, of graphs in the database in which it
occurs. Obviously, found fragments are more relevant if they occur in at least a given
minimum number/fraction of molecules. This minimum is called the minimum support
value. Fragments are discarded if they occur in fewer molecules than allowed for the
minimum support value, which is related to the significance of the found fragments. In
general, lower minimum support values will yield higher numbers of fragments. Choosing
a sufficiently high minimum support value will result in a comprehensible number of
fragments while mining is completed within a reasonable time-scale. By definition, the
support value of a fragment never exceeds the support values of the fragments it contains.
This restricts refinement generation further, starting only from fragments with sufficient
support (cf. A priori rule [ 6 ] ). To focus isomorphism testing, fragment-mining algorithms
may keep a mapping of the nodes and edges of a fragment to the corresponding nodes and
edges of the graph in which it occurs. This is known as an embedding.
To illustrate the process, let us consider a graphmining experiment on amolecule database
with alanine (Figure 8.3). For a single molecule in the database, such as alanine, a search
tree can be constructed of all possible fragments. Figure 8.4 shows all these fragments for
alanine with hydrogen atoms omitted as discussed before. On top is an empty fragment
and each following fragment is a substructure of its descendants below. Fragments on the
same level (six in total) have the same number of bonds (edges). For instance, the first level
contains the elements N, O and C, since these are the constituents of the molecule. The C-C
fragment on the second level forms the common core for the C-C-N, C-C-C, C-C
O and
C-C-O fragment on the third level. The arrows indicate the paths leading from an empty
fragment to the complete structure, yielding one extension at a time.
=
OH
H 2 N
O
Figure 8.3 Chemical structure of alanine. Implicit hydrogens are omitted.
There are two ways to travel the subgraph lattice as the entire scheme in Figure 8.4 is
called: breadth-first and depth-first. A breadth-first search considers all refinements at the
same level before advancing to the next. For Figure 8.4, this means stepping through the
lattice one row of fragments at a time. Storage requirements are proportional to the maximal
Search WWH ::




Custom Search