Full Perfect Extension Pruning for Frequent Subgraph Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

requirement is not a necessary condition. As a closer inspection easily reveals,

extensions closing a ring (that is, extensions by an edge leading to a node that is

already in the base fragment) are also safe and thus can be allowed as candidates

for perfect extensions: since the destination node is already in the fragment, there

cannot be a problem with multiple ways of reaching it. Hence we can slightly

relax the constraints.

Note, however, that these relaxed constraints are still only su cient, but not

necessary. There are other situations in which an extension can be considered

perfect, even though it does not meet the abovementioned requirements. For

example, if an edge leads to a new node and is part of rings of the same size and

composition in all supporting graphs, it is harmless and thus can be considered

a perfect extension. Even the extension by a bond from the nitrogen atom to the

oxygen atom in Figure 11.5 can be considered a safe perfect extension, despite

the fact that the rings have different size. Unfortunately, checking necessary and

sucient conditions for (safe) perfect extensions is complicated and costly and

thus we confine ourselves to the rule that an extension edge must either be a

bridge in all database graphs or must close a ring (cycle) in all database graphs

in order to be considered perfect.

11.4

Canonical Codes for Graphs

As we already mentioned in the introduction, perfect extension pruning, as it

was described in the previous section, is not a problem unless canonical forms are

used to identify redundant fragments. However, since canonical forms are a lot

more elegant than, for example, a repository of already processed fragments, need

less memory and make it easier to parallelize the search, it is desirable to be able

to use perfect extension pruning together with canonical form pruning. In this

section we briefly review some fundamentals of canonical forms for (attributed)

graphs, which are necessary to know in order to understand the code word

reorganization we describe in Section 11.6.

The core idea of canonical forms of graphs is to describe an (attributed) graph

by a code word , which uniquely identifies it up to automorphisms, and from which

the graph can be reconstructed. The letters of such a code word describe the

edges of the graph and which nodes they connect as well as the node and edge

attributes (or labels). In order to capture the connection structure, the nodes

are numbered (or, more generally, endowed with unique labels), since the node

attributes are not enough to identify them uniquely: the same attribute may be

assigned to several nodes in a graph. Of course, there are several possible ways

of numbering the nodes, each of which gives rise to a different code word. In

principle, all of these code words are taken into account and the lexicographically

smallest (or greatest) code word is then defined to be the canonical code word .

Note, however, that due to the way in which code words are used in the search

(see below), the possible node numberings (and thus the possible code words)

one has to consider can actually be restricted to those compatible with traversals

Mining Complex Data

Search WWH ::

Custom Search

Home