Information Technology Reference
In-Depth Information
might not be the best choice since it would lead to poor performances on the unseen
instances. To tackle the problem, many learning algorithms consider the general-
ization of the model from training to test data. For this purpose, one should choose
those models that well fit the data, but at the same time are as simple as possible
(to avoid over fitting on the training data). Then an immediate question is how to
measure the simplicity (or its counterpart complexity) of a model. This is one of the
central problems in statistical learning theory.
Regarding this question, one needs to address an important issue: the assump-
tion on the data. For this issue, one may need to be aware of the famous No Free
Lunch theory, which basically states that if there is no assumption on the connec-
tion between training and test data, prediction is impossible. Furthermore, if there
are no priori assumptions on possible test data that are expected, it is impossible to
generalize and there is thus no better algorithm (any algorithm will be beaten by
some other ones on certain kind of test data). Therefore, one needs to make a rea-
sonable assumption on the data before deriving learning algorithms and performing
theoretical analysis on them.
In statistical learning theory, the assumption on the data is that both training and
test data are sampled independently from the same distribution. The independence
assumption means that all new data yield maximum information, and the identical
distribution means that the new data give information about the same underlying
probability distribution.
In the rest of the section, we will take binary classification as an example to show
some basic concepts of statistical learning theory.
22.3.1 Formalization
In this subsection, we formalize the basic problems in statistical learning theory.
For simplicity, we consider binary classification, where the input space is
X
and
the output space is
are
random variables distributed according to an unknown distribution P and the goal
is to construct a function g : X Y
Y ={+
1 ,
1
}
. We assume that the pairs (X, Y )
X × Y
which predicts Y from X .
For randomly chosen test data, the error made by the function g can be repre-
sented as
R(g) = P g(X) = Y = E [ I { g(X) = Y } ] .
It is clear that if we introduce the regression function η(x)
=
E
[
Y
|
X
=
x
]=
2 P(Y =
sgn (η(x)) ,
then this target function would achieve the minimum risk over all the possible mea-
surable functions:
1
| X = x)
1 and the target function (or Bayes classifier) t(x) =
R(t) =
inf
g
R(g).
We denote the value of R(t) as R , and refer to it as the Bayes risk.
Search WWH ::




Custom Search