Mining Long Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

from program structure data would also reveal software backbones which are critical

in analyzing large software packages and understanding legacy systems [ 15 , 17 ].

It should be noted that long patterns are often sought after with another important

constraint of frequency, which contributes to much of the mining complexity. In the

following, we would denote most of this chapter to large frequent patterns, which

are referred to simply as long patterns unless otherwise specified.

Despite its significance in many applications, long patterns are not easy to find.

The fact that most existing algorithms discover frequent patterns with increasing

sizes means that, before larger patterns can be identified, smaller ones would have

to be explored, which typically come in exponentially large numbers. This poses

serious challenges to mining algorithms as the swamp of smaller patterns that they

have to examine could easily prohibit them from ever reaching the large ones in a

reasonable amount of time. While the task is already hard in the case of item sets

and sequences, the extra dimension of structural complexity in graph pattern mining

exacerbates the situation.

In this chapter, we will first give the preliminaries of large frequent pattern mining,

and introduce a pattern lattice model to explain the various algorithms we later

present. We then categorise algorithms for mining long patterns by their mining

paradigm into three categories: mining by pattern enumeration, mining by pattern

merging and mining by pattern traversal with neighborhood adjacency.

Preliminaries

Large frequent pattern mining has been primarily studied in three data settings: item

sets, sequences and graphs, which we define accordingly as follows.

Item Sets Item sets presents the simplest setting of frequent pattern mining. Let

be a set of items

{

o 1 , o 2 , ... , o d }

. A nonempty subset of

is called an itemset .A

transaction dataset D is a collection of itemsets, D

t 1 , ... , t n }

, where t i

⊆ I

For any itemset α , we denote the set of transactions that contain α as D α ={

⊆

t i and t i

∈

}

. Define the cardinality of an itemset α as the number of items it

contains, i.e.,

|=|{

o i |

o i ∈

Definition 4.1 (Frequent Itemset) For a transaction dataset D, an itemset α is

frequent in D if | D α |

σ , where | D α |

| D |

is called the support of α in D, written s ( α ) ,

and σ is the minimum support threshold , 0

| D | ≥

1 .

Mining long patterns in item set setting is simply to find all frequent item sets

with cardinality greater than a user-specified threshold.

≤

Sequences The setting of sequences includes two related yet different cases: fre-

quent substrings and frequent subsequences, the latter being computationally much

more challenging than the former. Given a string S

s 1 , ...s n

of length n , another

z 1 ...z m

≤

string Z

, m

n is a subsequence of S if there exists a sequence of

indices I

i 1 , ... , i m

, i j <i j + 1 ,1

≤

j<m such that such that z j

s i j for all

≤ j

≤ m . We call Z a subsequence of S , denoted as S ⊆ Z . If, in particular, we

Frequent Pattern Mining

Search WWH ::

Custom Search

Home