Database Reference
In-Depth Information
order of items i C , C I . We assume sequence elements are ordered in non-
decreasing order based on their timestamps. A sequence s
= C 1 , C 2 , ... , C m
,
m l is a sub-sequence of s if there exist integers i 1 , i 2 , ... , i m s.t. 1
i 1 i 2
1, 2, ... , m . In words, itemsets in s are subsets of
those in s and follow the same list order as in s .If s is a sub-sequence of s , we write
that s
l and C j
...
i m
C i j , j
=
s and say that s contains s and s is a super-sequence of s . Similar to the
itemset support, the support of s is defined as the proportion of sequences in
D
that
s |
s D
s }|
contain s , i. e., φ ( s )
=|{
, s
/
| D |
. A sequence is said to be a frequent
sequence if it has a support greater than σ .
A
similar
model
extension
has
been
proposed
for
mining
structures,
or
graphs/networks. We are given a set of graphs
typically have
labelled edges and vertices, though this is not required. V ( G ) and E ( G ) represent
the vertex and edge sets of a graph G , respectively. The graph G
G
of size
| G |
. Graphs in
G
=
( V ( G ), E ( G ))
is said to be a subgraph of another graph H
( V ( H ), E ( H )) if there is a bi-
jection from E ( G ) to a subset of E ( H ). The relation is noted as G
=
H . The
support of G is the proportion of graphs in
G
that have G as a subgraph, i.e.,
φ ( G )
=|{
H
|
H
G
, G
H
}|
/
| G |
. A graph is said to be a frequent graph if it has
a support greater than σ .
The problem of frequent pattern mining (FPM) is formally defined as follows.
Its specialization for the frequent itemset mining (FIM), frequent sequence mining
(FSM), and frequent graph mining (FGM) is straight-forward.
Definition 1
Given a pattern container
P
and a user-specified parameter σ (0
σ
1) , find all sub-patterns each of which is supported by at least
σ | P |
patterns
in
.
At times, we may wish to restrict the search to only maximal or closed patterns. A
maximal pattern m is not a sub-pattern of any other frequent pattern in the database,
whereas a closed pattern c has no proper super-pattern in the database with the same
support.
A number of variations of the frequent sequence and frequent graph problems
have been proposed. In some domains, the elements in a sequence are symbols
from an alphabet
P
. We call
these sequences symbol sequences . The symbol sequence model is equivalent to the
general itemset sequence model where
A
, e.g.,
A ={
A , C , G , T
}
and s
=
T GGT GAGT
. Another
interesting problem, sequence motif mining , looks to find frequent sub-sequences
within one (or a few) very long sequences. In this case, the support threshold is
given as a support count, the minimum number of occurrences of the sub-sequence,
rather than a value 0
|
C
|=
1 for all C
s , s
D
1, and additional constraints may be specified, such as
minimum/maximum sub-sequence length. A similar problem is defined for graphs,
unfortunately also called frequent graph mining in the literature, where the support
of G is the number of edge-disjoint subgraphs in a large graph
σ
that are isomorphic
to G . Two subgraphs are edge-disjoint if they do not share any edges. We call each
appearance of G in
G
G
an embedding . Two graphs G and H are isomorphic if there
exists a bijection between their vertex sets, f : V ( G )
V ( H ), s.t. any two vertices
u , v
V ( G ) are adjacent in G if and only if f ( u ) and f ( v ) are adjacent in H .
Search WWH ::




Custom Search