Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

Optimal V for Single Reward 1 at Terminal State

Optimal V for Reward -1 for Each Action

1

0

-1

0.8

-2

-3

0.6

-4

-5

0.4

-6

-7

0.2

-8

0

-9

14

12

10

8

6

4

2

0

14

12

10

8

6

4

2

0

Steps from Terminal State

(a)

(b)

Fig. 9.3. Plots showing the optimal value function for the 15-step corridor finite state

world for γ =0 . 9. The value function in (a) results from describing the task by giving

a single reward 1 upon reaching the terminal state, and a reward of 0 for all other

transitions. In (b) the values are based on a task description that gives a reward of

1

for all transitions. Note that in both cases the optimal policy is the same, but in (a)

all values are positive, and in (b) they are negative.

−

expected to suffer from the same problem, even though that remains to be shown

empirically.

Using the Relative Error

Barry proposed two preliminary approaches to handle the problem in long path

learning in XCS, both based on making the error calculation of a classifier relative

to its prediction of the value function [12]. The first approach is to estimate

the distance of the matched states to the terminal state and scale the error

accordingly, but this approach suffers from the inaccuracy of predicting this

distance.

A second, more promising alternative proposed in his study is to scale the

measured prediction error by the inverse absolute magnitude of the prediction.

The underlying assumption is that the difference in optimal values between two

successive states is proportional to the absolute magnitude of these values, as

can be see in Fig. 9.3(a). Consequently, the relative error is larger for states

that are further away from the terminal state, and overly general classifiers are

identified as such. This modification allows XCS to find the optimal policy in the

15-step corridor finite state world, which it fails to do without the modification.

Where it Fails

The problem of finding the shortest path to the terminal state can also be defined

differently: rather than giving a single reward of 1 upon reaching the terminal

state, one can alternatively punish each transition with a reward of

1. As the

reward is to be maximised, the number of transitions is minimised, and therefore

−

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home