Information Technology Reference
In-Depth Information
K
N
z nk
v k
x n
θ k
y n
experts
data
Fig. 4.1. Directed graphical model of the Mixtures-of-Experts model. The circular
nodes are random variables ( z nk ), which are observed when shaded ( y n ). Labels without
nodes are either constants ( x n ) or adjustable parameters ( θ k , v k ). The boxes are
“plates”, comprising replicas of the entities inside them. Note that z nk is shared by
both boxes, indicating that there is one z for each expert for each observation.
By marginalising 1 over z , the output density results in
x , θ )=
z
K
K
g k ( x ) z k p ( y
x , θ k ) z k =
p ( y
|
|
g k ( x ) p ( y
|
x , θ k ) ,
(4.8)
k =1
k =1
and subsequently, the log-likelihood l ( θ ;
D
)is
N
N
K
l ( θ ;
D
)=ln
p ( y n |
x n |
θ )=
ln
g k ( x n ) p ( y n |
x n , θ k ) .
(4.9)
n =1
n =1
k =1
Example 4.1 (Gating Network for 2 Experts). Let us consider the input space
D X = 3, where an input is given by x =(1 ,x 1 ,x 2 ) T . Assume two experts with
gating parameters v 1 =(0 , 0 , 1) T and v 2 =(0 , 1 , 0) T . Then, Fig. 4.2 shows the
gating values g 1 ( x ) for Expert 1 over the range
5
x 1
5,
5
x 2
5.
As can be seen, we have g 1 ( x ) > 0 . 5 in the input subspace x 1
x 2 < 0. Thus,
with the given gating parameters, Expert 1 mainly models observations in this
subspace. Overall, the gating network causes a soft linear partitioning of the
input space along the line x 1
x 2 = 0 that separates the two experts.
4.1.3
Training by Expectation-Maximisation
Rather than using gradient descent to find the experts and gating network para-
meters θ that maximise the log-likelihood (4.9) [120], we can make use of the la-
tent variable structure and apply the expectation-maximisation (EM) algorithm
1 Given a joint density p ( x, y ), one can get p ( y )by marginalising over x by
p ( y )=
p ( x, y )d x.
The same principle applies to getting p ( y|z ) from the conditional density p ( x, y|z ).
Search WWH ::




Custom Search