Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

only use LCS with RL if the LCS implementation can handle incremental model

parameter updates. Additionally, while approximating the action-value function

is a simple univariate regression task, the function estimate to approximate is

non-stationary due to its sequential update. Thus, in addition to incremental

learning, the LCS implementation needs to be able to handle non-stationary

target functions.

This section demonstrates how to derive Q-Learning with the LCS model as

introduced in Chap. 4, to act as a template for how any LCS model type can

be used for RL. Some of the introduced principles and derivations are specific

to the LCS model with independently trained classifiers, but the underlying

ideas also transfer to other LCS model types. The derivations themselves are

performed from first principles to make explicit the usually implicit design deci-

sions. Concentrating purely on incremental model parameter learning, the model

structure

is assumed to be constant. In particular, the derivation focuses on

the classifier parameter updates, as these are the most relevant with respect to

RL.

Even though the Bayesian update equations from Chap. 7 protect against

overfitting, this section falls back to maximum likelihood learning that was the

basis for the incremental methods described in Chaps. 5 and 6. The resulting

update equations conform exactly to XCS(F), which reveals its design principles

and should clarify some of the recent confusion about how to implement gradi-

ent descent in XCS(F). An additional bonus is a more accurate noise precision

update method for XCS(F) based on the methods developed in Chap. 5.

Firstly, the LCS approximation operator is introduced, that conforms to the

LCS model type of this work. This is followed by discussing how the principle of

independent classifier training relates to how DP and RL update the value and

action-value function estimates, which is essential for the use of this LCS model

type to perform RL. As Q-Learning is based on asynchronous value iteration,

it will be firstly shown how LCS can perform asynchronous value iteration,

followed by the derivation of two Q-Learning variants - one based on LMS, and

the other on RLS. Finally, these derivations are related to the recent controversy

about how XCS(F) correctly performs Q-Learning with gradient descent.

M

9.3.1 Approximating the Value Function

Given a value vector V , LCS approximates it by a set of K localised models

{

that are combined to form a global model V . The localised models are

provided by the classifiers, and the mixing model is used to combine these to

the global model.

Each classifier k matches a subset of the state space that is determined by its

matching function m k which returns for each state x the probability m k ( x )of

matching it. Let us for now assume that we approximate the value function V

rather than the action-value function Q . Then, classifier k provides the probabi-

listic model p ( V

V k }

x , θ k ) that gives the probability of the expected return of state

x having the value V . Assuming linear classifiers (5.3), this model is given by

p ( V

|

w k x ,τ − 1

|

x , θ k )=

N

( V

|

) ,

(9.19)

k

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home