Graphics Reference
In-Depth Information
θ
≡
,μ,τ
where
is the parameter set. Since the maximum likelihood estimation of
the PPCA is identical to PCA, PPCA is a natural extension of PCA to a probabilistic
model.
We present here a Bayesian estimation method for PPCA from the authors.
Bayesian estimation obtains the posterior distribution of
W
θ
and
X
, according to the
Bayes' theorem:
p
(θ,
X
|
Y
)
∝
p
(
Y
,
X
|
θ)
p
(θ).
(4.24)
p
.
The prior distribution is a part of the model and must be defined before estimation.
We assume conjugate priors for
(θ)
is called a prior distribution, which denotes a priori preference for parameter
θ
τ
and
μ
, and a hierarchical prior for
W
, namely, the
K
.
prior for
W
,
p
(
W
|
τ,α)
, is parameterized by a hyperparameter
α
∈ R
K
p
(θ
|
α)
≡
p
(μ,
W
,τ
|
α)
=
p
(μ
|
τ)
p
(τ )
p
(
w
j
|
τ,α
j
),
j
=
1
)
=
N(μ
|
μ
0
,(γ
μ
0
)
−
1
I
m
),
p
(μ
|
tau
,(α
j
τ)
−
1
I
m
),
p
(
w
j
|
τ,α
j
)
=
N(
w
j
|
0
p
(τ )
=
G(τ
|
τ
0
,γ
τ
0
)
G(τ
|
τ,γ
τ
)
denotes a Gamma distribution with hyperparameters
τ
and
γ
τ
:
exp
G(τ
|
τ,γ
τ
)
≡
(γ
τ
τ
−
1
)
γ
τ
Γ(γ
τ
)
−
γ
τ
τ
−
1
τ
+
(γ
τ
−
1
)
ln
τ
where
is a Gamma function.
The variables used in the above priors,
Γ(
·
)
τ
0
are deterministic
hyperparameters that define the prior. Th
ei
r actual val
u
es should be given before the
estimation. We set
γ
μ
0
,
μ
0
,
γ
τ
0
and
10
−
10
,
γ
μ
0
=
γ
τ
0
=
μ
0
=
0 and
τ
0
=
1, which corresponds to
an almost non-informative prior.
Assuming the priors and given a whole data set
Y
=
y
, the type-II maximum
α
ML
−
II
and the posterior distribution of the parameter,
likelihood hyperparameter
q
(θ)
=
, are obtained by Bayesian estimation.
The hierarchical prior
p
p
(θ
|
Y
,α
ML
−
II
)
, which is called an automatic relevance deter-
mination (ARD) prior, has an important role in BPCA. The
j
th principal axis
w
j
has a Gaussian prior, and its variance 1
(
W
|
α, τ )
/(α
j
τ)
is controlled by a hyperpara-
meter
α
j
which is determined by type-II maximum likelihood estimation from
the data. When the Euclidian norm of the principal axis,
w
j
, is small rela-
tively to the noise variance 1
α
j
gets large and the principal
axis
w
j
shrinks nearly to be 0. Thus, redundant principal axes are automatically
suppressed.
/τ
, the hyperparameter