Information Technology Reference
In-Depth Information
Relation to Ridge Regression
It is easy to show that the solution w N to minimising
2 ,
X N w y N
M N + λ w
( λ is the positive scalar ridge complexity ) with respect to w requires
( X N M N X N + λ I ) w N = X N M N y n
to hold. The above is similar to (5.30) with the additional term λ I . Hence,
(5.31) still holds when initialised with Λ 0 = λ I , and consequently so does (5.34).
Therefore, initialising Λ 1
= δ I to apply (5.35) to operate on Λ 1
rather than
Λ is equivalent to minimising (5.36) with λ = δ 1 .
In addition to the matching-weighted squared error, (5.36) penalises the size
of w . This approach is known as ridge regression and was initially introduced
to work around the problem of initially singular X N M N X N for small N ,that
prohibited the solution of (5.30). However, minimising (5.36) rather than (5.7) is
also advantageous if the input vectors suffer from a high noise variance, resulting
in large w and a bad model for the real data-generating process. Essentially, ridge
regression assumes that the size of w is small and hence computes better model
parameters for noisy data, given that the inputs are normalised [102, Chap. 3].
To summarise, using the RLS algorithm (5.34) and (5.35) with Λ 0 = δ I ,a
classifier performs ridge regression with ridge complexity λ = δ 1 . As by (5.36),
the contribution of
is independent of the number of observations N ,its
influence decreases exponentially with N .
A Recency-Weighted Variant
While the RLS algorithm provides a recursive solution such that (5.16) holds,
it weights all observations equally. Nonetheless, we might sometimes require
recency-weighting, such as when using LCS in combination with reinforcement
learning. Hence, let us derive a variant of RLS that applies a scalar decay factor
1 to past observations.
More formally, after N observations, we aim at minimising
m ( x n ) λ j = n +1 m ( x j ) ( w T x n
y n ) 2 =
X N w
y N
n =1
with respect to w ,wherethe λ -augmented diagonal matching matrix M N is
given by
m ( x 1 ) λ j =2 m ( x j )
m ( x 2 ) λ j =3 m ( x j )
. . .
m ( x N )
Search WWH ::

Custom Search