Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

9.4.3

· ∞

In the following, π ( x ) denotes the sampling probability for state x according to

the steady state distribution of the Markov chain P μ ,and π k ( x )= m k ( x ) π ( x )

denotes this distribution augmented by the matching of classifier k .Also, M k =

diag( m k ( x 1 ) ,...,m k ( x N )) is the matching matrix, as in Chap. 5, D =

diag( π ( x 1 ) ,...,π ( x N )) is the sampling distribution matrix, and D k = M k D

is the sampling distribution matrix with respect to classifier k . Each classifier k

aims at finding the weight vector w k that minimises

Non-expansion with Respect to

D k ,whichisgi-

ven by w k =( X T D k X ) − 1 X T D k V . Thus, a classifier's approximation operator is

the projection matrix Π k = X ( X T D k X ) − 1 X T D k . such that V = X w = Π k V .

It cannot be guaranteed that general linear models form a non-expansion with

respect to

Xw k −

· ∞ . Gordon, for example, has shown that this is not the case for

straight line models [96]. Averagers, on the other hand, are a form of linear

model, but provide a non-expansion with respect to

· ∞ and thus can be used

for both value iteration and policy iteration.

With respect to LCS, each single classifier, as well as the whole set of classifiers

need to conform to the non-expansion property. This rules out the general use

of linear model classifiers. Instead, only averaging classifiers (see Ex. 5.2) will be

discussed, as their model provides a non-expansion with respect to

Lemma 9.1. The model of averaging classifiers forms a non-expansion with re-

spect to the maximum norm.

Proof. As for averaging classifiers X =(1 ,..., 1) T , their approximation operator

is the same for all states and is given by

( Π k V ) j =Tr( D k ) − 1

x∈X

π k ( x ) V ( x ) .

(9.35)

Let V , V be two vectors such that V

≤

V , which implies that the vector

a = V −

V is non-negative in all its elements. Thus, we have for any i ,

( Π k V ) i =Tr( D k ) − 1

x∈X

Tr( D k ) − 1

x∈X

π k ( x ) V ( x )

−

π k ( x ) a

≤

( Π k V ) i , (9.36)

due to the non-negativity of the elements of D k a . Also, for any scalar b and

vector e =(1 ,..., 1) T ,

( Π k ( V + b e )) i =( Π k V ) i + b

(9.37)

holds.

Let V , V now be two arbitrary vectors, not bound to V

V ,andlet

≤

V ∞ =max i |

V i |

c =

−

V i −

be their maximum absolute distance. Thus, for

any i ,

V i

V i −

≤

V i + c

(9.38)

holds. Applying Π k and using the above properties gives

( Π k V ) i ≤

( Π k V ) i −

≤

( Π k V ) i + c,

(9.39)

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home