Information Technology Reference
In-Depth Information
input and the output and the constant noise variance. Let us have a closer look
at how λ = 0 influences ω 0 :As λ
0causes( λτ ) 1
→∞
, one can interpret the
prior ω 0 to be the multivariate Gaussian
N
( 0 ,
I ) (ignoring the problems that
come with the use of
). As a Gaussian with increasing variance approaches
the uniform distribution, the elements of the weight vectors are now equally
likely to take any possible value of the real line. Even though such a prior seems
unbiased at first, let us not forget that the uniform density puts most of its
weight on large values due to its uniform tails [70]. Thus, as linear least squares
is equivalent to ridge regression with λ = 0, its prior assumptions on the values
of the weight vector elements is that they are uncorrelated but most likely take
very large values. Large weight vector values, however, are usually a sign of non-
smooth functions. Thus, linear least squares implicitly assumes that the function
it models is not smooth.
As discussed in Sect. 3.1.1, a smooth function is a prerequisite for generali-
sation. Thus, we do actually assume smoothness of the function, and therefore
ridge regression with λ> 0 is more appropriate than plain linear least squares.
The prior that is associated with ridge regression is known as a shrinkage prior
[102], as it causes the weight vector elements to be smaller than without using
this prior. Ridge regression itself is part of a family of regularisation methods
that add the assumption of function smoothness to guide parameter learning in
otherwise ill-defined circumstances [213].
In summary, even methods that seemingly make no assumptions about the
parameter values are biased by implicit priors, as was shown by comparing ridge
regression to linear least squares. In any case, it is important to be aware of
these priors, as they are part of the assumptions that a model makes about
the data-generating process. Thus, when introducing the Bayesian LCS model,
special emphasis is put on how the introduced parameter priors express our
assumptions.
7.2
A Fully Bayesian LCS for Regression
The Bayesian LCS model for regression is equivalent to the one introduced as a
generalisation of the Mixtures-of-Experts model in Chap. 4, with the differences
that here, classifiers are allowed to perform multivariate rather than univariate
regression, and that priors and associated hyperpriors are assigned to all model
parameters. As such, it is a generalisation of the previous model as it comple-
tely subsumes it. A similar model for classification will be briefly discussed in
Sect. 7.5. For now the classifiers are not assumed to be trained independently.
This independence will be re-introduced at a later stage, analogous to Sect. 4.4.
Table 7.1 gives a summary of the Bayesian LCS model, and Fig. 7.2 shows
its variable dependency structure as a directed graph. The model is besides the
additional matching similar to the Bayesian MoE model by Waterhouse et al.
[227, 226], to the Bayesian mixture model of Ueda and Ghahramani [216], and
to the Bayesian MoE model of Bishop and Svensen [20]. Each of its components
will now be described in more detail.
 
Search WWH ::




Custom Search