Digital Signal Processing Reference
In-Depth Information
F
(η)
=
sup
θ
{
θ, η
−
F
(θ)
}
(16.5)
We get the maximum for
η
=∇
F
(θ)
. The parameters
η
are called expectation
parameters since
η
=
].
Gradient of
F
and of its dual
F
are inversely reciprocal:
E
[
t
(
x
)
=
∇
F
−
1
∇
F
(16.6)
and
F
itself can be computed by:
∇
F
−
1
F
=
+
constant
.
(16.7)
Notice that this integral is often difficult to compute and the convex conjugate
F
of
F
may be not known in closed-form. We can bypass the anti-derivative operation
by plugging in Eq. (
16.5
) the optimal value
(θ
∗
)
=
η
(that is,
θ
∗
=
(
∇
F
−
1
∇
F
)(η)
).
We get
F
(η)
=
(
∇
F
−
1
F
−
1
)(η), η
−
F
((
∇
)(η))
(16.8)
F
−
1
F
∗
, but allows us to discard
This requires to take the reciprocal gradient
∇
=∇
the constant of integration in Eq. (
16.7
).
Thus a member of an exponential family can be described equivalently with the
natural parameters or with the dual expectation parameters.
16.2.3 Bregman Divergences
The Kullback-Leibler (KL) divergence between two members of the same expo-
nential family can be computed in closed-form using a bijection between Bregman
divergences and exponential families. Bregman divergences are a family of diver-
gences parameterized by the set of strictly convex and differentiable functions
F
:
B
F
(
p
,
q
)
=
F
(
p
)
−
F
(
q
)
−
p
−
q
,
∇
F
(
q
)
(16.9)
F
is a strictly convex and differentiable function called the generator of the
Bregman divergence.
The family of Bregman divergences generalizes a lot of usual divergences, for
example:
x
2
,
•
the squared Euclidean distance, for
F
(
x
)
=
•
the Kullback-Leibler (KL) divergence, with the Shannon negative entropy
F
(
x
)
=
i
=
1
x
i
log
x
i
(also called Shannon information).