Information Technology Reference
In-Depth Information
observe that M contains K matching functions such that the set of possible M
grows exponentially with K .
The question of how to best specify p (
M
), and if there even is a “best”
prior on
M
, is not completely clear and requires further investigation. For now,
p (
M
1 /K ,or
)
ln K ! + const. (7.4)
is used for illustrative purposes. This prior can be interpreted as the prior p ( K )=
(e
ln p (
M
)=
1) 1 1 /K ! on the number of classifiers, where e
exp(1), and a uniform
p ( M |K ) that is absorbed by the constant term. Such a prior satisfies p ( K ) 0
for K →∞ and expresses that we expect the number of classifiers in the model
to be small 3 .
7.1.6
The Myth of No Prior Assumptions
A prior in the Bayesian sense is specified by a prior probability distribution
and expresses what is known about a random variable in the absence of some
evidence. For parametric models, the prior usually expresses what the model
parameters are expected to be, in the absence of any observations. As such, it
is part of the assumptions that are made about the data-generating process.
Combining the information of the prior and the data gives the posterior.
Having the need to specify prior distributions could be considered as a weak-
ness of Bayesian model selection, or even Bayesian statistics. Similarly, it could
also be seen as a weakness of the presented approach to define the best set of
classifiers. This view is justified by the idea that there exist other methods that
do not make any prior assumptions. But is this really the case?
Let us investigate the class of linear models as described in Chap. 5. Due to
linking the recursive least squares algorithm to ridge regression in Sect. 5.3.5 and
the Kalman filter in Sect. 5.3.6, it was shown that the ridge regression problem
2
2 + λ
mi w
Xw
y
w
(7.5)
( 0 , ( λτ ) 1 I )
is equivalent to conditioning a multivariate Gaussian prior ω 0 ∼N
on the available data
,where τ is the noise precision of the linear model
with respect to the data. Such a prior means that we assume each element of the
weight vector to be independent — due to the zero off-diagonal elements of the
diagonal covariance matrix — and zero-mean Gaussian with variance ( λτ ) 1 .
That is, we assume the elements most likely to be zero, but they can also have
other values with a likelihood that decreases with their deviation from zero.
Setting λ = 0 reduces (7.5) to a standard linear least squares problem without
any prior assumptions — as it seems — besides the linear relation between the
{
X , y
}
3 As pointed out by Dr. Dan Richardson, University of Bath, the prior p ( K ) 1 /K !
has E ( K ) < 2 and thus expresses the belief that the number of classifiers is
expected to be on average less than 2. He proposed the alternative prior p ( K )=
exp( −V ) V K /K !, where V is a constant related to volume, and E ( K )increaseswith
V .
 
Search WWH ::




Custom Search