Geoscience Reference
In-Depth Information
Depending on the domain of label
y
, supervised learning problems are further divided into
classification
and
regression
:
Definition 1.9.
Classification
.
Classification is the supervised learning problem with discrete
classes
Y
. The function
f
is called a
classifier
.
Y
Definition 1.10.
Regression
.
Regression is the supervised learning problem with continuous
.
The function
f
is called a
regression function
.
What exactly is a good
f
? The best
f
is by definition
f
∗
=
argmin
f
∈
F
E
(
x
,y)
∼
P
[
c(
x
,y,f(
x
))
]
,
(1.3)
where argmin means “finding the
f
that minimizes the following quantity”.
] is the
expectation over random test data drawn from
P
. Readers not familiar with this notation may wish
to consult Appendix A.
c(
E
(
x
,y)
∼
P
[
·
)
is a
loss function
that determines the cost or impact of making a prediction
f(
x
)
that is different from the true label
y
. Some typical loss functions will be discussed shortly. Note
we limit our attention to some function family
·
, mostly for computational reasons. If we remove
this limitation and consider all possible functions, the resulting
f
∗
is the
Bayes optimal predictor
, the
best one can hope for on average. For the distribution
P
, this function will incur the lowest possible
loss when making predictions. The quantity
F
E
(
x
,y)
∼
P
[
c(
x
,y,f
∗
(
x
))
]
is known as the
Bayes error
.
F
∈
F
However, the Bayes optimal predictor may not be in
in general. Our goal is to find the
f
that is as close to the Bayes optimal predictor as possible.
It is worth noting that the underlying distribution
P(
x
,y)
is unknown to us.Therefore, it is not
possible to directly find
f
∗
, or even to measure any predictor
f
's performance, for that matter. Here
lies the fundamental difficulty of statistical machine learning: one has to
generalize
the prediction
from a finite training sample to any unseen test data. This is known as
induction
.
To proceed, a seemingly reasonable approximation is to gauge
f
's performance using training
sample error. That is, to replace the unknown expectation by the average over the training sample:
i
=
1
, the training sample
Definition 1.11. Training sample error
.
Given a training sample
{
(
x
i
,y
i
)
}
error is
n
1
n
c(
x
i
,y
i
,f(
x
i
)).
(1.4)
i
=
1
For classification, one commonly used loss function is the 0-1 loss
c(
x
,y,f(
x
))
≡
(f (
x
i
)
=
y
i
)
:
n
1
n
(f (
x
i
)
=
y
i
),
(1.5)
i
=
1