Information Technology Reference
In-Depth Information
equivalent to having each classifier match all inputs. This results in a set of
classifiers that all match the whole input space, and localisation is performed by
soft linear partitioning of the gating network.
4.3.5
Relation to LCS
This generalised MoE model satisfies all characteristics of LCS outlined in
Sect. 3.2: Each classifier describes a localised model with its localisation de-
termined by the model structure, and the local models are combined to form a
global model. So given that the model can be trained eciently, and that there
exists a good mechanism for searching the space of model structures, do we al-
ready have an LCS? While some LCS researchers might disagree — partially
because there is no universal definition of what an LCS is and LCS appear to
be mostly thought of in algorithmic terms rather than in terms of the model
that they describe — the author believes that this is the case.
However, the generalised MoE model has a feature that no LCS has ever
used: beyond the localisation of classifiers by their matching function, the re-
sponsibilities of classifiers that share matching inputs is further distributed by
the softmax function. While this feature might lead to a better fit of the mo-
del to the data, it blurs the observation/classifier association by extending it
beyond the matching function. Nonetheless, the introduced transfer function φ
can be used to level this effect: when defined as the identity function φ ( x )= x ,
then by (4.21) the probability of a certain classifier generating an observation
for a matching input is log-linearly related to the input x . However, by setting
φ ( x ) = 1 for all x , the relation is reduced to g k ( x )
m k ( x )exp( v k ), where the
gating parameter v k reduces to the scalar v k . Hence, the gating weight becomes
independent of the input (besides the matching) and only relies on the constant
v k through exp( v k ). In areas of the input space that several classifiers match,
classifiers with a larger v k have a stronger influence when forming a prediction
of the global model, as they have a higher gating weight. To summarise, setting
φ ( x ) = 1 makes gating independent of the input (besides the matching) and the
gating weight for each classifier is determined by a single scalar that is inde-
pendent of the current input x that it matches. Further details and alternative
models for the gating network are discussed in Chap. 6.
Note that φ ( x ) = 1 is not applicable in the standard MoE model, that is,
when all classifiers match the full input space. In this case, we have neither
localisation by matching nor by the softmax function. Hence, the global model
is not better at modelling the data than a single local model applied to the whole
data.
Example 4.2 (Localisation by Matching and the Softmax Function). Consider the
same setting as in Example 4.1, and additionally φ ( x )= x for all x and the
matching functions
1if x 1 + x 2
3 ,
m 1 ( x )=
(4.23)
0 rwi ,
 
Search WWH ::




Custom Search