Database Reference
In-Depth Information
from program structure data would also reveal software backbones which are critical
in analyzing large software packages and understanding legacy systems [ 15 , 17 ].
It should be noted that long patterns are often sought after with another important
constraint of frequency, which contributes to much of the mining complexity. In the
following, we would denote most of this chapter to large frequent patterns, which
are referred to simply as long patterns unless otherwise specified.
Despite its significance in many applications, long patterns are not easy to find.
The fact that most existing algorithms discover frequent patterns with increasing
sizes means that, before larger patterns can be identified, smaller ones would have
to be explored, which typically come in exponentially large numbers. This poses
serious challenges to mining algorithms as the swamp of smaller patterns that they
have to examine could easily prohibit them from ever reaching the large ones in a
reasonable amount of time. While the task is already hard in the case of item sets
and sequences, the extra dimension of structural complexity in graph pattern mining
exacerbates the situation.
In this chapter, we will first give the preliminaries of large frequent pattern mining,
and introduce a pattern lattice model to explain the various algorithms we later
present. We then categorise algorithms for mining long patterns by their mining
paradigm into three categories: mining by pattern enumeration, mining by pattern
merging and mining by pattern traversal with neighborhood adjacency.
2
Preliminaries
Large frequent pattern mining has been primarily studied in three data settings: item
sets, sequences and graphs, which we define accordingly as follows.
Item Sets Item sets presents the simplest setting of frequent pattern mining. Let
I
be a set of items
{
o 1 , o 2 , ... , o d }
. A nonempty subset of
I
is called an itemset .A
transaction dataset D is a collection of itemsets, D
={
t 1 , ... , t n }
, where t i
I
.
For any itemset α , we denote the set of transactions that contain α as D α ={
i
|
α
t i and t i
D
}
. Define the cardinality of an itemset α as the number of items it
contains, i.e.,
|
α
|=|{
o i |
o i
α
}|
.
Definition 4.1 (Frequent Itemset) For a transaction dataset D, an itemset α is
frequent in D if | D α |
σ , where | D α |
| D |
is called the support of α in D, written s ( α ) ,
and σ is the minimum support threshold , 0
| D |
1 .
Mining long patterns in item set setting is simply to find all frequent item sets
with cardinality greater than a user-specified threshold.
σ
Sequences The setting of sequences includes two related yet different cases: fre-
quent substrings and frequent subsequences, the latter being computationally much
more challenging than the former. Given a string S
=
s 1 , ...s n
of length n , another
=
z 1 ...z m
string Z
, m
n is a subsequence of S if there exists a sequence of
indices I
=
i 1 , ... , i m
, i j <i j + 1 ,1
j<m such that such that z j
=
s i j for all
1
j
m . We call Z a subsequence of S , denoted as S Z . If, in particular, we
Search WWH ::




Custom Search