Learning Mixtures by Simplifying Kernel Density Estimators - Matrix Information Geometry

Digital Signal Processing Reference

In-Depth Information

F (η) =

sup

{ θ, η −

(θ) }

(16.5)

We get the maximum for η

=∇

(θ)

. The parameters η are called expectation

parameters since η =

Gradient of F and of its dual F are inversely reciprocal:

E [ t

(

)

= ∇

F − 1

∇

(16.6)

and F itself can be computed by:

∇

F − 1

F =

constant

(16.7)

Notice that this integral is often difficult to compute and the convex conjugate F

of F may be not known in closed-form. We can bypass the anti-derivative operation

by plugging in Eq. ( 16.5 ) the optimal value

(θ ∗ ) = η (that is, θ ∗ = ( ∇

F − 1

∇

)(η)

We get

F (η) = ( ∇

F − 1

)(η), η −

(( ∇

)(η))

(16.8)

F − 1

F ∗ , but allows us to discard

This requires to take the reciprocal gradient

∇

=∇

the constant of integration in Eq. ( 16.7 ).

Thus a member of an exponential family can be described equivalently with the

natural parameters or with the dual expectation parameters.

16.2.3 Bregman Divergences

The Kullback-Leibler (KL) divergence between two members of the same expo-

nential family can be computed in closed-form using a bijection between Bregman

divergences and exponential families. Bregman divergences are a family of diver-

gences parameterized by the set of strictly convex and differentiable functions F :

B F (

) =

(

) −

(

) −

−

, ∇

(

)

(16.9)

F is a strictly convex and differentiable function called the generator of the

Bregman divergence.

The family of Bregman divergences generalizes a lot of usual divergences, for

example:

x 2 ,

•

the squared Euclidean distance, for F

(

) =

•

the Kullback-Leibler (KL) divergence, with the Shannon negative entropy F

(

) =

i = 1 x i log x i (also called Shannon information).

Matrix Information Geometry

Search WWH ::

Custom Search

Home