Probabilistic Reasoning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

Raiffa and Schaifeer suggested using conjugate distributions as prior

distributions, where the posterior distribution and the corresponding prior

distribution are the same kind of distribution. The general description of

conjugate distribution is as follows:

Definition 6.7

Let the conditional distribution of samples x 1 , x 2 , …, x n under

parameters ȶ is p(x 1 , x 2 , …, x n | ȶ ). If the prior density function ʩ ( ȶ ) and its

resulting posterior density function ʩ ( ȶ |x) are in the same family, the prior

density function ʩ ( ȶ ) is said to be conjugate to the conditional distribution p(x| ȶ ).

Definition 6.8

Let P = {p(x| ȶ ): ȶ ∈ ŉ } be the density function family with

parameters ȶ . H = ʩ ( ȶ ) is the prior distribution family of ȶ . If for any given p ∈ P

and ʩ ∈ H, the resulting posterior distribution ʩ ( ȶ |x) is always in family H, H is

said to be the conjugate family to P

When the density functions of data distribution and its prior are all exponential

functions, the resulting function of their multiplication is the sample kind of

exponential function. The only difference is a factor of proportionality. So we

have:

Theorem 6.4

If for random variable Z, the kernel of its density function f(x) is

exponential function, the density function belongs to conjugate family.

All the distributions with exponential kernel function compose exponential

family, which includes binary distribution, multinomial distribution, normal

distribution, Gamma distribution, Poisson distribution and Dirichlet distribution.

Conjugate distributions can provide a reasonable synthesis of historical trials

and a reasonable precondition for future trials. The computation of non-conjugate

distribution is rather difficult. In contrast, the computation of conjugate

distribution is easy, where only multiplication with prior is required. So, in fact,

the conjugate family makes firm foundation for practical application of Bayesian

learning.

2. Principle of maximum entropy

Entropy is used to quantify the uncertainty of event in information theory. If a

random variable x takes two different possible value, namely a and b, comparing

the following two case:

(1)

p

(

x=a

) = 0.98,

p

(

x=b

) =0.02

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home