Information Technology Reference
In-Depth Information
as will be demonstrated in Sect. 6.3. As an alternative, this section introduces
some heuristic mixing models that scale linearly with the number of classifiers,
just like the least squares approximation, and feature better performance.
Before discussing different heuristics, let us define the requirements on g k :to
preserve their probabilistic interpretation, we require g k ( x )
0 for all k and x ,
and k g k ( x ) = 1 for all x . In addition, we need to honour matching, which
means that if m k ( x ) = 0, we need to have g k ( x ) = 0. These requirements are
met if we define
m k ( x ) γ k ( x )
j =1 m j ( x ) γ j ( x ) ,
g k ( x )=
(6.18)
+
where
is a set of K functions returning positive scalars, that
implicitly rely on the mixing model parameters V . Thus, the mixing model
defines a weighted average, where the weights are specified on one hand by the
matching functions, and on the other hand by the functions γ k . The heuristics
differ among each other only in how they define the γ k 's.
Note that the generalised softmax function (6.2) also performs mixing by
weighted average, as it conforms to (6.18) with γ k ( x )=exp( v k x ) and mixing
model parameters V =
{
γ k :
X→ R
}
. The weights it assigns to each classifier are deter-
mined by the log-linear model exp( v k x ), which needs to be trained separately,
depending on the responsibilities that express the goodness-of-fit of the classifier
models for the different inputs. In contrast, all heuristic models that are introdu-
ced here rely on measures that are part of the classifiers' linear regression models
and do not need to be fitted separately. As they do not have any adjustable pa-
rameters, they all have V =
{
v k }
. The heuristics assume classifiers to use regression
rather than classification models. For the classification case, similar heuristics
are easily found by using the observations of the following section, that are valid
for any form of classifier model, to guide the design of these heuristics.
6.2.1
Properties of Weighted Averaging Mixing
f k :
f k ( x )=
x , θ k ), that is, the estimator of
classifier k defined by the mean of the conditional distribution of the output given
the input and the classifier parameters. Equally, let
X→ R
E
( y
|
Let
be given by
f :
X→ R
be the global
model estimator, given by f ( x )= E ( y| x ). As by (4.8) we have p ( y| x )=
k g k ( x ) p ( y| x , θ k ), the global estimator is related to the local estimators by
f ( x )=
y
k
x , θ k )d y =
k
g k ( x ) f k ( x ) ,
g k ( x ) p ( y
|
(6.19)
Y
and, thus, is also a weighted average of the local estimators. From this follows
that f is bounded from below and above by the lowest and highest estimate of
the local models, respectively, that is
f k ( x )
f ( x )
f k ( x ) ,
∈X
.
min
k
max
k
x
(6.20)
In general, we aim at minimising the deviation of the global estimator f from
the target function f that describes the data-generating process. If we measure
 
Search WWH ::




Custom Search