Information Technology Reference
In-Depth Information
where H is the Hessian matrix whose elements comprise the second derivatives
of E ( V ), and
E ( V ) is the gradient vector of E ( V ) with respect to V .Even
though not immediately obvious, its name derives from a reformulation of the
update procedure that reveals that, at each update step, the algorithm solves a
weighted least squares problem where the weights change at each step [19].
As we want to maximise (6.3), our function to minimise is the cross-entropy
N
K
E ( V )=
r nk ln g k ( x n ) .
(6.6)
n =1
k =1
The gradient of g k with respect to v j is
v j g k ( x )= g k ( x )( I kj
g j ( x )) φ ( x ) ,
(6.7)
and, thus, the gradient of E ( V )evaluatesto
v 1 E ( V )
.
N
V E ( V )=
,
v j E ( V )=
( g j ( x n )
r nj ) φ ( x n ) ,
(6.8)
n =1
v K E ( V )
wherewehaveused k g k ( x ) = 1. The Hessian matrix
H 11
···
H 1 K
.
.
. . .
H =
,
(6.9)
H K 1 ···
H KK
is constructed by evaluating its D V
× D V blocks
N
g j ( x n )) φ ( x n ) φ ( x n ) T ,
H kj = H jk =
g k ( x n )( I kj
(6.10)
n =1
that result from H kj =
v k v j E ( V ).
To summarise the IRLS algorithm, given N observations
D
=
{
X , Y
}
,and
knowledge of the classifier parameters
x , θ k ), we
can incrementally improve the estimate V by repeatedly performing (6.5), star-
ting with arbitrary initial values for V . As the Hessian matrix H given by (6.9)
is positive definite [19], the error function E ( V ) is convex, and the IRLS al-
gorithm will approach is unique minimum, although, not monotonically [119].
Thus, E ( V ) after (6.6) will decrease, and can be used to monitor convergence
of the algorithm.
Note, however, that by (6.5), a single step of the algorithm requires compu-
tation of the gradient
{
θ 1 ,..., θ K }
to evaluate p ( y
|
KD V Hessian matrix
H , and the inversion of the latter. Due to this inversion, a single iteration of the
V E ( V )ofsize KD V ,the KD V
×
 
Search WWH ::




Custom Search