Information Technology Reference
In-Depth Information
where
H
is the Hessian matrix whose elements comprise the second derivatives
of
E
(
V
), and
E
(
V
) is the gradient vector of
E
(
V
) with respect to
V
.Even
though not immediately obvious, its name derives from a reformulation of the
update procedure that reveals that, at each update step, the algorithm solves a
weighted least squares problem where the weights change at each step [19].
As we want to maximise (6.3), our function to minimise is the cross-entropy
∇
N
K
E
(
V
)=
−
r
nk
ln
g
k
(
x
n
)
.
(6.6)
n
=1
k
=1
The gradient of
g
k
with respect to
v
j
is
∇
v
j
g
k
(
x
)=
g
k
(
x
)(
I
kj
−
g
j
(
x
))
φ
(
x
)
,
(6.7)
and, thus, the gradient of
E
(
V
)evaluatesto
⎛
⎞
∇
v
1
E
(
V
)
.
⎝
⎠
N
∇
V
E
(
V
)=
,
∇
v
j
E
(
V
)=
(
g
j
(
x
n
)
−
r
nj
)
φ
(
x
n
)
,
(6.8)
n
=1
∇
v
K
E
(
V
)
wherewehaveused
k
g
k
(
x
) = 1. The Hessian matrix
⎛
⎝
⎞
⎠
H
11
···
H
1
K
.
.
.
.
.
H
=
,
(6.9)
H
K
1
···
H
KK
is constructed by evaluating its
D
V
× D
V
blocks
N
g
j
(
x
n
))
φ
(
x
n
)
φ
(
x
n
)
T
,
H
kj
=
H
jk
=
g
k
(
x
n
)(
I
kj
−
(6.10)
n
=1
that result from
H
kj
=
∇
v
k
∇
v
j
E
(
V
).
To summarise the IRLS algorithm, given
N
observations
D
=
{
X
,
Y
}
,and
knowledge of the classifier parameters
x
,
θ
k
), we
can incrementally improve the estimate
V
by repeatedly performing (6.5), star-
ting with arbitrary initial values for
V
. As the Hessian matrix
H
given by (6.9)
is positive definite [19], the error function
E
(
V
) is convex, and the IRLS al-
gorithm will approach is unique minimum, although, not monotonically [119].
Thus,
E
(
V
) after (6.6) will decrease, and can be used to monitor convergence
of the algorithm.
Note, however, that by (6.5), a single step of the algorithm requires compu-
tation of the gradient
{
θ
1
,...,
θ
K
}
to evaluate
p
(
y
|
KD
V
Hessian matrix
H
, and the inversion of the latter. Due to this inversion, a single iteration of the
∇
V
E
(
V
)ofsize
KD
V
,the
KD
V
×
Search WWH ::
Custom Search