Information Technology Reference
In-Depth Information
where
z
nk
is the
k
th element of
z
n
. As only one element of
z
n
can be 1, the
above expression is equivalent to the
j
th expert model such that
z
nj
=1.
As the logarithm function is monotonically increasing, maximising the loga-
rithm of the likelihood is equivalent to maximising the likelihood. Combining
(4.1) and (4.2), the log-likelihood ln
p
(
Y
|
X
,
Z
,
θ
)resultsin
N
K
ln
p
(
Y
|
X
,
Z
,
θ
)=
z
nk
ln
p
(
y
n
|
x
n
,
θ
k
)
.
(4.3)
n
=1
k
=1
Inspecting (4.3) we can see that each observation
n
is assigned to the single
expert for which
z
nk
= 1. Hence, it is maximised by maximising the likelihood
of the expert models separately, for each expert based on its assigned set of
observations.
4.1.2 Parametric Gating Network
As the latent variables
Z
are not directly observable, we do not know the values
that they take and therefore cannot maximise the likelihood introduced in the
previous section directly. Rather, a parametric model for
Z
,knownasthe
gating
network
, is used instead and trained in combination with the experts.
The gating network used in the standard MoE model is based on the assump-
tion that the probability of an expert having generated the observation (
x
,
y
)is
log-linearly related to the input
x
.Thisisformulatedby
exp(
v
k
x
)
,
(4.4)
stating that the probability of expert
k
having generated observation (
x
,
y
)is
proportional to the exponential of the inner product of the input
x
and the
gating vector
v
k
of the same size as
x
. Normalising
p
(
z
k
=1
g
k
(
x
)
≡
p
(
z
k
=1
|
x
,
v
k
)
∝
|
x
,
v
k
), we get
exp(
v
k
x
)
j
=1
exp(
v
j
x
)
g
k
(
x
)
≡
p
(
z
k
=1
|
x
,
v
k
)=
,
(4.5)
which is the well-known
softmax
function, and corresponds to the multinomial
logit model in Statistics that is often used to model consumer choice [165]. It
is parametrised by one gating vector
v
k
per expert, in combination forming the
set
V
=
. Fig. 4.1 shows the directed graphical model that illustrates the
structure and variable dependencies of the Mixtures-of-Experts model.
To get the log-likelihood
l
(
θ
;
{
v
k
}
X
,
θ
), we use the 1-of-
K
structure
of
z
to express the probability of having a latent random vector
z
for a given
input
x
and a set of gating parameters
V
by
D
)
≡
ln
p
(
Y
|
K
K
x
,
v
k
)
z
k
=
g
k
(
x
)
z
k
.
p
(
z
|
x
,
V
)=
p
(
z
k
=1
|
(4.6)
k
=1
k
=1
Thus, by combining (4.2) and (4.6), the joint density over
y
and
z
is given by
K
g
k
(
x
)
z
k
p
(
y
x
,
θ
k
)
z
k
.
p
(
y
,
z
|
x
,
θ
)=
|
(4.7)
k
=1
Search WWH ::
Custom Search