The Optimal Set of Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

observe that M contains K matching functions such that the set of possible M

grows exponentially with K .

The question of how to best specify p (

), and if there even is a “best”

prior on

, is not completely clear and requires further investigation. For now,

p (

∝

1 /K ,or

)

ln K ! + const. (7.4)

is used for illustrative purposes. This prior can be interpreted as the prior p ( K )=

ln p (

−

1) − 1 1 /K ! on the number of classifiers, where e

≡

exp(1), and a uniform

p ( M |K ) that is absorbed by the constant term. Such a prior satisfies p ( K ) → 0

for K →∞ and expresses that we expect the number of classifiers in the model

to be small 3 .

7.1.6

The Myth of No Prior Assumptions

A prior in the Bayesian sense is specified by a prior probability distribution

and expresses what is known about a random variable in the absence of some

evidence. For parametric models, the prior usually expresses what the model

parameters are expected to be, in the absence of any observations. As such, it

is part of the assumptions that are made about the data-generating process.

Combining the information of the prior and the data gives the posterior.

Having the need to specify prior distributions could be considered as a weak-

ness of Bayesian model selection, or even Bayesian statistics. Similarly, it could

also be seen as a weakness of the presented approach to define the best set of

classifiers. This view is justified by the idea that there exist other methods that

do not make any prior assumptions. But is this really the case?

Let us investigate the class of linear models as described in Chap. 5. Due to

linking the recursive least squares algorithm to ridge regression in Sect. 5.3.5 and

the Kalman filter in Sect. 5.3.6, it was shown that the ridge regression problem

2 + λ

mi w

−

(7.5)

( 0 , ( λτ ) − 1 I )

is equivalent to conditioning a multivariate Gaussian prior ω 0 ∼N

on the available data

,where τ is the noise precision of the linear model

with respect to the data. Such a prior means that we assume each element of the

weight vector to be independent — due to the zero off-diagonal elements of the

diagonal covariance matrix — and zero-mean Gaussian with variance ( λτ ) − 1 .

That is, we assume the elements most likely to be zero, but they can also have

other values with a likelihood that decreases with their deviation from zero.

Setting λ = 0 reduces (7.5) to a standard linear least squares problem without

any prior assumptions — as it seems — besides the linear relation between the

{

X , y

}

3 As pointed out by Dr. Dan Richardson, University of Bath, the prior p ( K ) ∝ 1 /K !

has E ( K ) < 2 and thus expresses the belief that the number of classifiers is

expected to be on average less than 2. He proposed the alternative prior p ( K )=

exp( −V ) V K /K !, where V is a constant related to volume, and E ( K )increaseswith

V .

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home