Information Technology Reference
In-Depth Information
Our goal is to find this target function t(x) , however, since P is unknown we
cannot know directly the value of t at the data points. We can only use the following
empirical risk to estimate t :
n
1
n
R(g) =
I { g(X i ) = Y i } .
i
=
1
However, it might not be a good choice to directly minimize the empirical risk in
the entire function class due to the risk of over fitting. There are usually two ways
of tackling the challenges. The first is to restrict the class of functions in which the
minimization is performed, and the second is to modify the criterion to be mini-
mized (e.g., adding a penalty for complex functions). These two ways can also be
combined in some algorithms.
Considering the aforementioned ways of avoiding over fitting, researchers usu-
ally design their learning algorithms according to the following paradigms.
Empirical Risk Minimization . Algorithms belonging to this category are the most
straightforward yet usually very effective. The idea is to restrict the function class
to be
G
when performing the minimization of the empirical risk. That is,
R(g).
g n =
arg min
g G
Structural Risk Minimization . Algorithms belonging to this category introduce a
penalty of the size of the function class. Suppose there is an infinite sequence
{ G d :
d
of function classes with increasing size, and the learning algorithm
actually minimizes the following objective function:
=
1 , 2 ,...
}
G d R(g)
g n =
+
arg min
g
Pen(d, n)
where the penalty Pen(d, n) gives preference to the models from a smaller function
class.
Sometimes, an alternative and easier-to-implement approach to the structural risk
minimization is to adopt a regularization term (typically the norm of g ) in the ob-
jective function, as follows:
R(g)
2 .
g n =
arg min
g
+
λ
g
G
As can be seen above, the eventual goal is to minimize the expected risk R(g)
while the empirical risk R(g) (or structural risk) is actually minimized by learning
algorithms. Given the learned function g n G
, in order to evaluate whether it is
good or not, one still needs to look at R(g n ) . However, R(g n ) is a random variable
and cannot be computed from the data. Therefore, one can usually only take the form
of probabilistic bounds for the analysis on the learning algorithms. In the following
paragraphs, we will introduce the basic forms of such bounds.
Here we introduce the best function g G
with R(g )
=
inf g G
R(g) ,wehave
R = R(g )
R + R(g n )
R(g ) .
R(g n )
Search WWH ::




Custom Search