Geoscience Reference
In-Depth Information
80 CHAPTER 8. THEORY ANDOUTLOOK
algorithm actually has true error larger than some . For each random draw of
D
P , this event
{ e(f D )> }
either happens or does not happen. Therefore, we can talk about the probability of this
event
Pr D P ( { e(f D )> } ) .
(8.3)
Our goal will be to show that this probability is small. The learning algorithm picked f D because
e(f D ) =
0. There could be other functions in
F
with zero training error on that
D
. Consider the
union of the events
{ f F : e(f ) = 0 } { e(f)> }
. This union contains the event
{ e(f D )> }
. Therefore,
Pr D P { f F : e(f ) = 0 } {
}
Pr D P (
{
e(f D )>
}
)
e(f)>
Pr D P { f F } { e(f ) =
0 ,e(f)> }
=
Pr D P { f F : e(f )> } { e(f ) = 0 } ,
=
(8.4)
where the second and the third lines are different ways to represent the same union of events. Now
by the union bound Pr(A B) P r(A) + P r(B) ,
Pr D P { f F : e(f )> } { e(f ) =
}
Pr D P { e(f ) =
} .
0
0
(8.5)
{ f F : e(f )> }
The true error e(f ) for a given f can be thought of as a biased coin with heads probability e(f ) .
Because
D
e(f ) is the fraction of heads in l independent
coin flips. If the heads probability e(f)> , the probability of producing l tails in a row is bounded
by ( 1 ) l , the product of l independent Bernoulli trials. This is precisely the probability of the
event
is drawn from P , the training sample error
{ e(f ) =
0
}
. Thus,
Pr D P { e(f ) = 0 }
( 1 ) l .
(8.6)
{ f F : e(f )> }
{ f F : e(f )> }
Finally, assuming that the function family is finite in size,
) l
) l
) l
≤| F | e l ,
( 1
( 1
=| F | ( 1
(8.7)
{ f F : e(f )> }
{ f F }
x e x . Therefore, by connecting the preceding four
equations, we arrive at the following inequality:
where the last inequality follows from 1
e l .
Pr D P (
{
e(f D )>
}
)
≤| F |
(8.8)
To see its relevance to learning, consider the complement event
Pr D P ( { e(f D ) } ) 1 −| F | e l .
(8.9)
This says that probably (i.e., on at least 1 −| F | e l fraction of random draws of the training sample),
the function f D , picked by the supervised learning algorithm because
e(f D ) =
0, is approximately
Search WWH ::




Custom Search