Information Technology Reference
In-Depth Information
model but it is less clear how to separate the global model into local classifier
models. Maximising the likelihood for such a model results in the least-squares
problem (6.35) with f ( x ; θ )= k g k ( x ) w k x , the solution to which has been
discussed in the previous chapter.
To the other extreme, one could from the start assume that the classifiers
are trained independently, such that each of them provides the model c k with
predictive density p ( y
|
x ,c k ). The global model is formed by marginalising over
the local models,
K
p ( y
|
p ( y
|
x ,c k ) p ( c k |
x ) ,
x )=
(6.37)
k =1
where p ( c k |
x ) is the probability of the model of classifier k being the “true”
model, given a certain input x . This term can be used to introduce matching,
by setting p ( c k |
x )=0if m k ( x ) = 0. Averaging over models by their probability
is known as Bayesian Model Averaging [107], which might initially look like
resulting in the same formulation as the model derived from the generalised
MoE model. The essential difference, however, is that p ( y | x ,c k ) is independent
of the model parameters θ k as it marginalises over them,
x ,c k )=
p ( y
|
p ( y
|
x , θ k ,c k ) p ( θ k |
c k )d θ k .
(6.38)
Therefore, it cannot be directly compared to the mixing models introduced in
this chapter, and should be treated as a different LCS model, closely related to
ensemble learning. Further research is required to see if such an approach leads
to viable LCS formulations.
6.5
Summary and Outlook
This chapter dealt with an essential LCS component that directly emerges from
the introduced LCS model and is largely ignored by LCS research: how to
combine a set of localised models, provided by the classifiers, to provide a global
prediction. The aim of this “mixing problem” was defined by maximising the
data likelihood (6.1) of the previously introduced LCS model.
As was shown, the IRLS algorithm is a possible approach to finding the glo-
bally optimal mixing parameters V to the generalised softmax mixing model,
but it suffers from high complexity, and can therefore act as nothing more than
a benchmark to compare other approaches to. The least squares approximation,
on the other hand, scales well but lacks the desired performance, as shown in
experiments.
As an alternative, heuristics that are inspired by formal properties of mixing
by weighted average have been introduced. Not only do they scale well with the
number of classifiers as they do not have any adjustable parameters other than
the classifier parameters, but they also perform better than mixing by the least
squares approximation. In particular, mixing by inverse variance makes the least
assumptions of the introduced heuristics, and is also the best-performing one
 
Search WWH ::




Custom Search