Training the Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

constant element (which is usually fixed to 1), which has the equal effect. For

example, consider the input space to be the set of reals; that is

, D X =1

and both x and w are scalars. In such a case, the assumption of a linear model

implies that the observed output follows xw , which is a straight line through the

origin with slope w . To add the bias term, we can instead assume an augmented

input space

, with input vectors x =(1 , x ) T , resulting in the linear

model w T x = w 1 + w 2 x - a straight line with slope w 2 and bias w 1 . Equally, the

input vector can be augmented by other elements to extend the expressiveness

of the linear model, as shown in the following example:

X =

{

}× R

Example 5.1 (Common Classifier Models used in XCS(F)). Initially, classifiers in

XCS [237, 238] only provided a single prediction, independent of the input. Such

behaviour is equivalent to having the scalar input x n =1forall n ,astheweight w

then models the output as an average over all matched outputs, as will be demons-

trated in Example 5.2. Hence, such classifiers will be called averaging classifiers .

Later, Wilson introduced XCSF (the F standing for “function”), that initially

used straight lines as the local models [241]. Hence, in the one-dimensional case,

the inputs are given by x n =(1 ,i n ) to model the output by w 1 + w 2 i n ,where

i n is the variable part of the input. This concept was taken further by Lanzi

et al. [141] by applying 2nd and 3rd order polynomials, using the input vectors

x n =(1 ,i n ,i n ) T and x n =(1 ,i n ,i n ,i n ) T respectively. Naturally, the input vector

does not need to be restricted to taking i n to some power, but allows for the

use of arbitrary functions. These functions are known as basis functions ,asthey

construct the base of the input space. Nonetheless, increasing the complexity of

the input space makes it harder to interpret the local models. Hence, if it is the

aim to understand the localised model, these models should be kept simple -

such as straight lines.

5.1.2 Gaussian Noise

The noise term captures the stochasticity of the data-generating process and

the measurement noise. In the case of linear models, the inputs and outputs

are assumed to stand in a linear relation. Every deviation from this relation

is captured by and is interpreted as noise. Hence, assuming the absence of

measurement noise, the fluctuation of gives information about the adequacy of

assuming a linear model. In other words, if the variance of is small, then inputs

and outputs do indeed follow a linear relation. Hence, the variance of can be

used as a measure of how well the local model fits the data. For that reason, the

aim is not only to find a weight vector that maximises the likelihood, but also

to simultaneously estimate the variance of .

For linear models it is common to assume that the random variable re-

presenting the noise has zero mean, constant variance, and follows a normal

distribution [97], that is

(0 ,τ − 1 ), where τ is the noise precision (inverse

noise variance). Hence, in combination with (5.1), and for some realisation w of

ω and input x , the output is modelled by

∼N

w T x ,τ − 1 )= τ

2 π

1 / 2

exp

y ) 2 , (5.3)

2 ( w T x

x , w ,τ − 1 )=

∼

p ( y

( y

−

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home