Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

order of items i ∈ C , C ⊆ I . We assume sequence elements are ordered in non-

decreasing order based on their timestamps. A sequence s

= C 1 , C 2 , ... , C m

m ≤ l is a sub-sequence of s if there exist integers i 1 , i 2 , ... , i m s.t. 1

≤ i 1 ≤ i 2 ≤

1, 2, ... , m . In words, itemsets in s are subsets of

those in s and follow the same list order as in s .If s is a sub-sequence of s , we write

that s ⊆

l and C j

...

≤

i m ≤

⊆

C i j , j

s and say that s contains s and s is a super-sequence of s . Similar to the

itemset support, the support of s is defined as the proportion of sequences in

that

s |

s ∈ D

s }|

contain s , i. e., φ ( s )

=|{

, s

⊆

| D |

. A sequence is said to be a frequent

sequence if it has a support greater than σ .

similar

model

extension

has

been

proposed

for

mining

structures,

graphs/networks. We are given a set of graphs

typically have

labelled edges and vertices, though this is not required. V ( G ) and E ( G ) represent

the vertex and edge sets of a graph G , respectively. The graph G

of size

| G |

. Graphs in

( V ( G ), E ( G ))

is said to be a subgraph of another graph H

( V ( H ), E ( H )) if there is a bi-

jection from E ( G ) to a subset of E ( H ). The relation is noted as G

⊆

H . The

support of G is the proportion of graphs in

that have G as a subgraph, i.e.,

φ ( G )

=|{

∈ G

, G

⊆

| G |

. A graph is said to be a frequent graph if it has

a support greater than σ .

The problem of frequent pattern mining (FPM) is formally defined as follows.

Its specialization for the frequent itemset mining (FIM), frequent sequence mining

(FSM), and frequent graph mining (FGM) is straight-forward.

Definition 1

Given a pattern container

and a user-specified parameter σ (0

≤

1) , find all sub-patterns each of which is supported by at least

σ | P |

patterns

At times, we may wish to restrict the search to only maximal or closed patterns. A

maximal pattern m is not a sub-pattern of any other frequent pattern in the database,

whereas a closed pattern c has no proper super-pattern in the database with the same

support.

A number of variations of the frequent sequence and frequent graph problems

have been proposed. In some domains, the elements in a sequence are symbols

from an alphabet

. We call

these sequences symbol sequences . The symbol sequence model is equivalent to the

general itemset sequence model where

, e.g.,

A ={

A , C , G , T

}

and s

T GGT GAGT

. Another

interesting problem, sequence motif mining , looks to find frequent sub-sequences

within one (or a few) very long sequences. In this case, the support threshold is

given as a support count, the minimum number of occurrences of the sub-sequence,

rather than a value 0

1 for all C

∈

s , s

∈ D

1, and additional constraints may be specified, such as

minimum/maximum sub-sequence length. A similar problem is defined for graphs,

unfortunately also called frequent graph mining in the literature, where the support

of G is the number of edge-disjoint subgraphs in a large graph

≤

that are isomorphic

to G . Two subgraphs are edge-disjoint if they do not share any edges. We call each

appearance of G in

an embedding . Two graphs G and H are isomorphic if there

exists a bijection between their vertex sets, f : V ( G )

→

V ( H ), s.t. any two vertices

u , v

∈

V ( G ) are adjacent in G if and only if f ( u ) and f ( v ) are adjacent in H .

Frequent Pattern Mining

Search WWH ::

Custom Search

Home