Geoscience Reference
In-Depth Information
80 CHAPTER 8. THEORY ANDOUTLOOK
algorithm actually has true error larger than some
. For each random draw of
D
∼
P
, this event
{
e(f
D
)>
}
either happens or does not happen. Therefore, we can talk about the probability of this
event
Pr
D
∼
P
(
{
e(f
D
)>
}
) .
(8.3)
Our goal will be to show that this probability is small. The learning algorithm picked
f
D
because
e(f
D
)
=
0. There could be other functions in
F
with zero training error on that
D
. Consider the
union of the events
∪
{
f
∈
F
:
e(f )
=
0
}
{
e(f)>
}
. This union contains the event
{
e(f
D
)>
}
. Therefore,
Pr
D
∼
P
∪
{
f
∈
F
:
e(f )
=
0
}
{
}
Pr
D
∼
P
(
{
e(f
D
)>
}
)
≤
e(f)>
Pr
D
∼
P
∪
{
f
∈
F
}
{
e(f )
=
0
,e(f)>
}
=
Pr
D
∼
P
∪
{
f
∈
F
:
e(f )>
}
{
e(f )
=
0
}
,
=
(8.4)
where the second and the third lines are different ways to represent the same union of events. Now
by the union bound
Pr(A
∪
B)
≤
P r(A)
+
P r(B)
,
Pr
D
∼
P
∪
{
f
∈
F
:
e(f )>
}
{
e(f )
=
}
≤
Pr
D
∼
P
{
e(f )
=
}
.
0
0
(8.5)
{
f
∈
F
:
e(f )>
}
The true error
e(f )
for a given
f
can be thought of as a biased coin with heads probability
e(f )
.
Because
D
e(f )
is the fraction of heads in
l
independent
coin flips. If the heads probability
e(f)>
, the probability of producing
l
tails in a row is bounded
by
(
1
−
)
l
, the product of
l
independent Bernoulli trials. This is precisely the probability of the
event
is drawn from
P
, the training sample error
{
e(f )
=
0
}
. Thus,
Pr
D
∼
P
{
e(f )
=
0
}
≤
(
1
−
)
l
.
(8.6)
{
f
∈
F
:
e(f )>
}
{
f
∈
F
:
e(f )>
}
Finally, assuming that the function family is finite in size,
−
)
l
−
)
l
−
)
l
≤|
F
|
e
−
l
,
(
1
≤
(
1
=|
F
|
(
1
(8.7)
{
f
∈
F
:
e(f )>
}
{
f
∈
F
}
−
x
≤
e
−
x
. Therefore, by connecting the preceding four
equations, we arrive at the following inequality:
where the last inequality follows from 1
e
−
l
.
Pr
D
∼
P
(
{
e(f
D
)>
}
)
≤|
F
|
(8.8)
To see its relevance to learning, consider the complement event
Pr
D
∼
P
(
{
e(f
D
)
≤
}
)
≥
1
−|
F
|
e
−
l
.
(8.9)
This says that probably (i.e., on at least 1
−|
F
|
e
−
l
fraction of random draws of the training sample),
the function
f
D
, picked by the supervised learning algorithm because
e(f
D
)
=
0, is approximately