Information Technology Reference
In-Depth Information
K
N
z
nk
v
k
x
n
θ
k
y
n
experts
data
Fig. 4.1.
Directed graphical model of the Mixtures-of-Experts model. The circular
nodes are random variables (
z
nk
), which are observed when shaded (
y
n
). Labels without
nodes are either constants (
x
n
) or adjustable parameters (
θ
k
,
v
k
). The boxes are
“plates”, comprising replicas of the entities inside them. Note that
z
nk
is shared by
both boxes, indicating that there is one
z
for each expert for each observation.
By marginalising
1
over
z
, the output density results in
x
,
θ
)=
z
K
K
g
k
(
x
)
z
k
p
(
y
x
,
θ
k
)
z
k
=
p
(
y
|
|
g
k
(
x
)
p
(
y
|
x
,
θ
k
)
,
(4.8)
k
=1
k
=1
and subsequently, the log-likelihood
l
(
θ
;
D
)is
N
N
K
l
(
θ
;
D
)=ln
p
(
y
n
|
x
n
|
θ
)=
ln
g
k
(
x
n
)
p
(
y
n
|
x
n
,
θ
k
)
.
(4.9)
n
=1
n
=1
k
=1
Example 4.1 (Gating Network for 2 Experts).
Let us consider the input space
D
X
= 3, where an input is given by
x
=(1
,x
1
,x
2
)
T
. Assume two experts with
gating parameters
v
1
=(0
,
0
,
1)
T
and
v
2
=(0
,
1
,
0)
T
. Then, Fig. 4.2 shows the
gating values
g
1
(
x
) for Expert 1 over the range
−
5
≤
x
1
≤
5,
−
5
≤
x
2
≤
5.
As can be seen, we have
g
1
(
x
)
>
0
.
5 in the input subspace
x
1
−
x
2
<
0. Thus,
with the given gating parameters, Expert 1 mainly models observations in this
subspace. Overall, the gating network causes a soft linear partitioning of the
input space along the line
x
1
−
x
2
= 0 that separates the two experts.
4.1.3
Training by Expectation-Maximisation
Rather than using gradient descent to find the experts and gating network para-
meters
θ
that maximise the log-likelihood (4.9) [120], we can make use of the la-
tent variable structure and apply the expectation-maximisation (EM) algorithm
1
Given a joint density
p
(
x, y
), one can get
p
(
y
)by
marginalising
over
x
by
p
(
y
)=
p
(
x, y
)d
x.
The same principle applies to getting
p
(
y|z
) from the conditional density
p
(
x, y|z
).
Search WWH ::
Custom Search