Database Reference
In-Depth Information
4.3
Subgraphs
Discovering frequent subgraphs is a well-studied problem; however, there have been
very few works which specifically seek frequent or important subgraphs from a
stream of graphs.
In [ 11 ] Bifet et al. present what is believed to be the first algorithms for finding
frequent subgraphs in a data stream of graphs. The objective is to find the closed
subgraphs which occur at least as frequently as the minimum support threshold.
To reduce the memory and computational requirements, they process the stream
in batches and compute an approximate but compact coreset representation for the
set of frequent closed graphs. Rather than recording the support of each pattern,
they record the relative support, which is the difference in support between a closed
graph and an its nearest subgraph. They offer three algorithms, INCGRAPHMINER,
WINGRAPHMINER, and ADAGRAPHMINER. ADAGRAPHMINER is the best solution
when there is concept drift in the data stream. It maintains a sliding window which
adapts its width in response to the rate of concept drift, that is, whether the average
value within the window has changed by more than a threshold parameter. Using
the Hoeffdinger inequality, they guarantee that the number of false positive and false
negative errors is bounded. Using Galois lattice theory, Bifet develops a more general
methodology for identifying closed patterns in data streams [ 9 ].
In [ 5 ], the problem at hand is defined differently. Given a stream of graphs
G 1 , G 2 , ... which draw from a massive set of possible vertices, they detect the
frequent and significant patterns. In particular, since the graphs will be sparse, their
definition of significance draws upon two measures of density: node affinity and edge
density. Given a vertex set P , let its node affinity A ( P ) be the ratio of the number of
graphs which contain all the vertices of P compared to the number which contain
at least one vertex from P . The edge density D ( P ) is defined as follows: among
those graphs which fully contain P , count the average number of edges joining ver-
tices in P and divide by the | P 2 possible set of edges joining P . A subgraph P is
( θ , γ )-significant if A ( P )
γ .
The two-phase algorithm first computes patterns which have θ -affinity and then the
subset of those with γ -density. Both phases of the algorithm use a min-hash approach
which effectively summarizes the graph stream, estimating the probability that all
members of a set have a property if some subset of them have that property. This
transforms the node-affinity mining problem to a support-based mining problem.
The overall algorithm maintains a set of min-hash statistics which are updated as
new graphs stream in. The authors prove that their method enforces bounds on false
positive and false negative errors.
θ and D ( P )
5
Concluding Remarks
In this chapter, we have provided an overview of algorithms for frequent pattern min-
ing over data streams. The streaming and unbounded nature of the data restricts us to
a single pass over the data, which further complicates the already challenging task of
Search WWH ::




Custom Search