Theory andOutlook - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

80 CHAPTER 8. THEORY ANDOUTLOOK

algorithm actually has true error larger than some . For each random draw of

D ∼

P , this event

{ e(f D )> }

either happens or does not happen. Therefore, we can talk about the probability of this

event

Pr D ∼ P ( { e(f D )> } ) .

(8.3)

Our goal will be to show that this probability is small. The learning algorithm picked f D because

e(f D ) =

0. There could be other functions in

with zero training error on that

. Consider the

union of the events

∪ { f ∈ F : e(f ) = 0 } { e(f)> }

. This union contains the event

{ e(f D )> }

. Therefore,

Pr D ∼ P ∪ { f ∈ F : e(f ) = 0 } {

}

Pr D ∼ P (

{

e(f D )>

}

)

≤

e(f)>

Pr D ∼ P ∪ { f ∈ F } { e(f ) =

0 ,e(f)> }

Pr D ∼ P ∪ { f ∈ F : e(f )> } { e(f ) = 0 } ,

(8.4)

where the second and the third lines are different ways to represent the same union of events. Now

by the union bound Pr(A ∪ B) ≤ P r(A) + P r(B) ,

Pr D ∼ P ∪ { f ∈ F : e(f )> } { e(f ) =

} ≤

Pr D ∼ P { e(f ) =

} .

(8.5)

{ f ∈ F : e(f )> }

The true error e(f ) for a given f can be thought of as a biased coin with heads probability e(f ) .

Because

e(f ) is the fraction of heads in l independent

coin flips. If the heads probability e(f)> , the probability of producing l tails in a row is bounded

by ( 1 − ) l , the product of l independent Bernoulli trials. This is precisely the probability of the

event

is drawn from P , the training sample error

{ e(f ) =

}

. Thus,

Pr D ∼ P { e(f ) = 0 } ≤

( 1 − ) l .

(8.6)

{ f ∈ F : e(f )> }

Finally, assuming that the function family is finite in size,

− ) l

≤| F | e − l ,

( 1

≤

( 1

=| F | ( 1

(8.7)

{ f ∈ F : e(f )> }

{ f ∈ F }

− x ≤ e − x . Therefore, by connecting the preceding four

equations, we arrive at the following inequality:

where the last inequality follows from 1

e − l .

Pr D ∼ P (

{

e(f D )>

}

)

≤| F |

(8.8)

To see its relevance to learning, consider the complement event

Pr D ∼ P ( { e(f D ) ≤ } ) ≥ 1 −| F | e − l .

(8.9)

This says that probably (i.e., on at least 1 −| F | e − l fraction of random draws of the training sample),

the function f D , picked by the supervised learning algorithm because

e(f D ) =

0, is approximately

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home