Information Technology Reference
In-Depth Information
each of these problems operate on an input space of dimensionality D V ,and
hence, using the least squares methods introduced in the previous chapter, have
either complexity
( KD V ) for each step
of the incremental solution. Given that we usually have D V =1inLCS,thisis
certainly an appealing property.
When minimising (6.15) it is essential to consider that the values for r nk by
(6.4) depend on the current v k of all classifiers. Consequently, when performing
batch learning, it is not sucient to solve all K least squares problems only
once, as the corresponding targets change with the updated values of V .Thus,
again one needs to repeatedly update the estimate V until the cross-entropy
(6.6) converges.
On the other hand, using recursive least squares to provide an incremental
approximation of V we need to honour the non-stationarity of the target values
by using the recency-weighted RLS variant. Hence, according to Sect. 5.3.5 the
update equations take the form
( NKD V ) for the batch solution or
O
O
v kN +1 = λ m k ( x n ) v kN
(6.16)
kN +1 φ ( x N +1 ) ln
v kN φ ( x N +1 ) T ,
r nk
m k ( x n )
+ m k ( x N +1 ) Λ 1
Λ 1
kN +1 = λ −m ( x N +1 ) Λ 1
(6.17)
kN
Λ 1
kN φ ( x N +1 ) φ ( x N +1 ) T Λ 1
m ( x N +1 ) λ −m ( x N +1 )
kN
λ m k ( x n ) + m k ( x N +1 ) φ ( x N +1 ) T Λ 1
kN φ ( x N +1 ) ,
where the v k 's and Λ k 's are initialised to v k 0 = 0 and Λ 1
k 0 = δ I for all k ,
with δ being a large scalar. In [121], Jordan and Jacobs initially set λ =0 . 99
and increased a fixed fraction (0 . 6) of the remaining distance to 1 . 0 every 1000
updates. This seems a sensible approach to start with, but further empirical
experience is required to make definite recommendations.
As pointed out by Jordan and Jacobs [121], approximating the values of V by
least squares does not result in the same parameter estimates as when using the
IRLS algorithm, due to the use of least squares rather than maximum likelihood.
In fact, the least squares approach can be seen as an approximation to the
maximum likelihood solution under the assumption that the residual in (6.15) in
small, which is equivalent to assuming that the LCS model can fit the underlying
regression surface and that the noise is small. Nonetheless, they demonstrate
empirically that the least squares approach provides good results even when the
residual is large in the early stages of training [121]. In any case, in terms of
complexity it is a very appealing alternative to the IRLS algorithm.
6.2
Heuristic-Based Mixing Models
While the IRLS algorithm minimises (6.6), it does not scale well with the number
of classifiers. The least squares approximation, on the other hand, scales well,
but minimises (6.15) instead of (6.6), which does not always give good results,
Search WWH ::




Custom Search