Information Technology Reference
In-Depth Information
where z nk is the k th element of z n . As only one element of z n can be 1, the
above expression is equivalent to the j th expert model such that z nj =1.
As the logarithm function is monotonically increasing, maximising the loga-
rithm of the likelihood is equivalent to maximising the likelihood. Combining
(4.1) and (4.2), the log-likelihood ln p ( Y
|
X , Z , θ )resultsin
N
K
ln p ( Y
|
X , Z , θ )=
z nk ln p ( y n |
x n , θ k ) .
(4.3)
n =1
k =1
Inspecting (4.3) we can see that each observation n is assigned to the single
expert for which z nk = 1. Hence, it is maximised by maximising the likelihood
of the expert models separately, for each expert based on its assigned set of
observations.
4.1.2 Parametric Gating Network
As the latent variables Z are not directly observable, we do not know the values
that they take and therefore cannot maximise the likelihood introduced in the
previous section directly. Rather, a parametric model for Z ,knownasthe gating
network , is used instead and trained in combination with the experts.
The gating network used in the standard MoE model is based on the assump-
tion that the probability of an expert having generated the observation ( x , y )is
log-linearly related to the input x .Thisisformulatedby
exp( v k x ) , (4.4)
stating that the probability of expert k having generated observation ( x , y )is
proportional to the exponential of the inner product of the input x and the
gating vector v k of the same size as x . Normalising p ( z k =1
g k ( x )
p ( z k =1
|
x , v k )
| x , v k ), we get
exp( v k x )
j =1 exp( v j x )
g k ( x )
p ( z k =1
|
x , v k )=
,
(4.5)
which is the well-known softmax function, and corresponds to the multinomial
logit model in Statistics that is often used to model consumer choice [165]. It
is parametrised by one gating vector v k per expert, in combination forming the
set V =
. Fig. 4.1 shows the directed graphical model that illustrates the
structure and variable dependencies of the Mixtures-of-Experts model.
To get the log-likelihood l ( θ ;
{ v k }
X , θ ), we use the 1-of- K structure
of z to express the probability of having a latent random vector z for a given
input x and a set of gating parameters V by
D
)
ln p ( Y
|
K
K
x , v k ) z k =
g k ( x ) z k .
p ( z
|
x , V )=
p ( z k =1
|
(4.6)
k =1
k =1
Thus, by combining (4.2) and (4.6), the joint density over y and z is given by
K
g k ( x ) z k p ( y
x , θ k ) z k .
p ( y , z
|
x , θ )=
|
(4.7)
k =1
Search WWH ::




Custom Search