Information Technology Reference
In-Depth Information
Relation to Ridge Regression
It is easy to show that the solution
w
N
to minimising
2
2
,
X
N
w
−
y
N
M
N
+
λ
w
(5.36)
(
λ
is the positive scalar
ridge complexity
) with respect to
w
requires
(
X
N
M
N
X
N
+
λ
I
)
w
N
=
X
N
M
N
y
n
(5.37)
to hold. The above is similar to (5.30) with the additional term
λ
I
. Hence,
(5.31) still holds when initialised with
Λ
0
=
λ
I
, and consequently so does (5.34).
Therefore, initialising
Λ
−
1
0
=
δ
I
to apply (5.35) to operate on
Λ
−
1
rather than
Λ
is equivalent to minimising (5.36) with
λ
=
δ
−
1
.
In addition to the matching-weighted squared error, (5.36) penalises the size
of
w
. This approach is known as
ridge regression
and was initially introduced
to work around the problem of initially singular
X
N
M
N
X
N
for small
N
,that
prohibited the solution of (5.30). However, minimising (5.36) rather than (5.7) is
also advantageous if the input vectors suffer from a high noise variance, resulting
in large
w
and a bad model for the real data-generating process. Essentially, ridge
regression assumes that the size of
w
is small and hence computes better model
parameters for noisy data, given that the inputs are normalised [102, Chap. 3].
To summarise, using the RLS algorithm (5.34) and (5.35) with
Λ
−
0
=
δ
I
,a
classifier performs ridge regression with ridge complexity
λ
=
δ
−
1
. As by (5.36),
the contribution of
is independent of the number of observations
N
,its
influence decreases exponentially with
N
.
w
A Recency-Weighted Variant
While the RLS algorithm provides a recursive solution such that (5.16) holds,
it weights all observations equally. Nonetheless, we might sometimes require
recency-weighting, such as when using LCS in combination with reinforcement
learning. Hence, let us derive a variant of RLS that applies a scalar decay factor
0
≤
1 to past observations.
More formally, after
N
observations, we aim at minimising
λ
≤
N
m
(
x
n
)
λ
j
=
n
+1
m
(
x
j
)
(
w
T
x
n
−
y
n
)
2
=
2
M
N
X
N
w
−
y
N
(5.38)
n
=1
with respect to
w
,wherethe
λ
-augmented diagonal matching matrix
M
N
is
given by
⎛
⎞
m
(
x
1
)
λ
j
=2
m
(
x
j
)
0
⎝
⎠
m
(
x
2
)
λ
j
=3
m
(
x
j
)
M
N
=
.
(5.39)
.
.
.
0
m
(
x
N
)
Search WWH ::
Custom Search