Mining Graph Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

2 L

arg

min

t ∈ F , y ∈{± 1 }

−

y i h t , y ( x i )),

(13.1)

F = i = 1 {

where

is a set of candidate graphs or a feature set ( i.e. ,

⊆

x i }

) and

I (

) is the indicator function. The gain function for a rule

t , y

is defined as

gain (

t , y

)

y i h t , y ( x i ) .

(13.2)

Using the gain, the search problem in Eq. ( 13.1 ) becomes equivalent to the

problem:

t ,

arg max t ∈ F , y ∈{± 1 } gain (

t , y

). Then the gain function is used

instead of error rate.

gboost applies AdaBoost [ 17 ] by repeatedly calling the decision stumps and

finally produces a hypothesis f , which is a linear combination of K hypotheses pro-

duced by the decision stumps f ( x )

sgn k = 1 α k h t k , y k ( x ) . In the k th iteration,

a decision stump is built with weights d ( k )

d ( k )

, ... , d ( k L on the training data,

where i = 1 d ( k )

1, d ( k )

0. The weights are calculated to concentrate more

on hard examples than easy ones. In the boosting framework, the gain function is

redefined as

≥

gain (

t , y

)

y i d i h t , y ( x i ) .

(13.3)

i = 1

A Branch-and-Bound Search Approach According to the gain function in

Eq. ( 13.3 ), the problem of finding the optimal rule

t ,

from the training dataset is

defined as follows.

Problem 1 [Find Optimal Rule] Let T

x 1 , y 1 , d 1

, ... ,

x L , y L , d L }

be a training

data set where x i is a labeled graph, y i ∈{±

}

is a class label associated with x i and

d i ( i = 1 d i

1, d i

≥

0) is a normalized weight assigned to x i .Given T , find the

t ,

optimal rule

that maximizes the gain, i.e. ,

arg max t ∈ F , y ∈{± 1 } y i d i h

t , y

F = i = 1 { t | t ⊆

where

A naive method is to enumerate all subgraphs

x i }

and then calculate the gains for

all subgraphs. However, this method is impractical since the number of subgraphs is

exponential to their size. To avoid such exhaustive enumeration, the method to find

the optimal rule is modeled as a branch-and-bound algorithm based on the upper

bound of the gain function which is defined as follows.

Lemma 13.10 (Upper bound of the gain) For any t ⊇

t and y

∈{±

}

, the gain of

t , y

is bounded by μ ( t ) (i.e., gain (

)

≤

μ ( t ) ), where μ ( t ) is given by

⎛

⎝ 2

{ i | y i =+ 1, t ⊆ x i }

⎞

d i ,2

{ i | y i =− 1, t ⊆ x i }

⎠ .

μ ( t )

max

d i −

y i ·

d i +

y i ·

d i

i = 1

(13.4)

Frequent Pattern Mining

Search WWH ::

Custom Search

Home