Geoscience Reference
In-Depth Information
to be overfitting, and can be expected to be a good predictor. In this section we discuss some other
algorithms which explicitly enforce hypothesis agreement, without requiring explicit feature splits
or the iterative mutual-teaching procedure. To understand these algorithms, we need to introduce
the regularized risk minimization framework for machine learning.
Recall that, in general, we can define a
loss function
to specify the cost of mistakes in prediction:
Definition 4.3. Loss Function
.L t
x
its true label, and
f(
x
)
our
prediction. A loss function
c(
x
,y,f(
x
))
∈[
0
,
∞
)
measures the amount of loss, or cost, of this
prediction.
∈
X
be an instance,
y
∈
Y
For example, in regression we can define the squared loss
c(
x
,y,f(
x
))
=
(y
−
f(
x
))
2
.In
classification we can define the 0/1 loss as
c(
x
,y,f(
x
))
=
1 if
y
=
f(
x
)
, and 0 otherwise. The
loss function can be different for different types of misclassification. In medical diagnosis we
might use
c(
x
,y
=
healthy
)
=
100.
The loss function can also depend on the instance
x
: The same amount of medical prediction error
might incur a higher loss on an infant than on an adult.
healthy
,f(
x
)
=
diseased
)
=
1 and
c(
x
,y
=
diseased
,f(
x
)
=
Definition 4.4. Empirical Risk
.
The empirical risk of
f
is the average loss incurred by
f
on a
l
i
=
1
c(
x
i
,y
i
,f(
x
i
))
.
R(f )
=
1
labeled training sample:
Applying the principle of
empirical risk minimization
(ERM)—finding the
f
that minimizes
the empirical risk—may seem like a natural thing to do:
R(f ),
f
ERM
=
argmin
f
∈
F
(4.2)
F
where
is the set of all hypotheses we consider. For classification with 0/1 loss, ERM amounts to
minimize the training sample error.
However,
f
ERM
can overfit the particular training sample. As a consequence,
f
ERM
is not
necessarily the classifier in
with the smallest true risk on future data. One remedy is to
regularize
the empirical risk by a regularizer
(f)
. The regularizer
(f)
is a non-negative functional, i.e.,
it takes a function
f
as input and outputs a non-negative real value. The value is such that if
f
is
“smooth” or “simple” in some sense,
(f)
will be close to zero; if
f
is too zigzagged (i.e., it overfits
and attempts to pass through all labeled training instances),
(f)
is large.
F
Definition 4.5.
Regularized Risk
.
The regularized risk is the weighted sum of the empirical risk
R(f )
and the regularizer, with weight
λ
λ(f )
.The principle of
regularized riskminimization
is to find the
f
that minimizes the regularized risk:
≥
0:
+
f
∗
=
R(f )
+
λ(f ).
argmin
f
∈
F
(4.3)