Mining Long Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

1. Stage I: Mining Spiders

Mine all r -spiders from the input graph G . By the end of this stage, all the frequent

patterns up to a diameter 2 r with all their embeddings in G are obtained.

2. Stage II: Large Pattern Identification

Randomly pick M spiders from all the spiders obtained in Stage I as the initial set

of frequent subgraphs. The next step consists of

D max

2 r iterations. In each iteration,

use SpiderGrow() to grow each of the M subgraphs by extending its boundary

with selected spiders such that the radius of the subgraph is increased by r . In each

iteration, if it is detected that two frequent subgraphs, whose embeddings are all

previously disjoint, begin to overlap on some of their embeddings as a result of

growth in this iteration, the algorithm would merge them if the resulting merged

subgraph is frequent. Note that one can avoid pair-wise checking for potential

merging because all patterns grow with spiders as units and one only has to

monitor the same spiders being used by different patterns to detect overlapping.

At the end of the

D max

2 r iterations, keep only those frequent subgraphs which are

generated as a result of merging at some iteration. Let the set kept be S . The

frequent subgraphs in S are believed to be subgraphs of long patterns with high

probability.

3. Stage III: Large Pattern Recovery

With high probability, each one of the top- K long patterns now has some portion

of it as a pattern in S . To recover the full patterns, grow each subgraph in S

by SpiderGrow() until no more frequent patterns can be found. All the patterns

discovered so far are maintained in a list sorted by their size. Return the top- K

patterns.

A key question here is that, in Stage II of SpiderMine , how to choose M , the

number of initial seed spiders, to achieve the discovery of top- K largest patterns

with guaranteed probability. If more than one spider within a pattern P are chosen

in the random drawing process, it is said that P is successfully identified. Denote as

P success the probability that all the top- K largest patterns are successfully identified.

In [ 33 ], the authors show the following lemma with detailed proof sketch.

Lemma 4.11

Given a network G and a user-specified K, we have P success

≥

) M K

V min

| V ( G ) |

V min is the minimum number of vertices in a long pattern required by users,

usually an easy lower bound that a user can specify. Now to compute M , we just

need to set 1

−

( M

1)(1

−

) M K

V min

| V ( G )

and solve for M . It follows that,

once the user specifies K and , we could compute M accordingly, and then if we

pick M spiders initially in the random drawing process, we are able to return the

top- K largest patterns with probability at least 1

−

( M

1)(1

−

. For example, with

0 . 1,

| V ( G )

10, and V min =

, we get M

85, which means to return top 10 largest

patterns(each of size at least | V ( G ) |

10 if any) with probability at least 90 %, we need to

randomly draw 85 spiders initially. With the analysis above, it is not hard to prove

the following theorem.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home