Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

the optimal policy is the same as before. Fig. 9.3(b) shows the optimal value

function for the modified problem definition.

Observe that, in contrast to Fig. 9.3(a), all values of V ∗ are negative or zero,

and their absolute magnitude grows with the distance from the terminal state.

The difference in magnitude between two successive state, on the other hand,

still decreases with the distance from the terminal state. This clearly violates the

assumption that this difference is proportional to the absolute magnitude of the

values, as the modified problem definition causes exactly the opposite pattern.

Hence, the relative error approach will certainly fail, as it was not designed to

handle such cases.

To create a task where the relative error measure fails, the problem had to be

redefined such that the value function takes exclusively negative values. While

it might be possible to do the opposite and redefine each problem such that it

conforms to the assumption that the relative error measure is based on, an alter-

native that does not require modification of the problem definition is preferable.

A Possible Alternative?

It was shown in Sect. 8.3.4 that the optimality criterion that was introduced in

Chap. 7 is able to handle problem where the noise differs in different areas of

the input space. Given that it is possible to use this criterion in an incremen-

tal implementation, will such an implementation be able to perform long path

learning?

As previously discussed (see Sect. 5.1.2 and 7.2.2), a linear classifier model

attributes all observed deviation from its linear model to measurement noise

(implicitly including the stochasticity of the data-generating process). In rein-

forcement learning, and additional component of stochasticity is introduced by

updating the value function estimates which makes them non-stationary. Thus,

in order for the LCS model to provide a good representation of the value func-

tion estimate, it needs to be able to handle both the measurement noise and the

update noise - a differentiation that is absent Barry's work [11, 12].

Let us assume that the optimality criterion causes the size of the area of the

input space that is matched by a classifier to be proportional to the level of

noise in the data, such that the model is refined in areas where the observations

are known to accurately represent the data-generating process. Considering only

measurement noise, when applied to value function approximation this would

lead to having more specific classifiers in states where the difference in magnitude

of the value function for successive states is low, as in such areas this noise is

deemed to be low. Therefore, the optimality criterion should provide an adequate

value function approximation of the optimal value function, even in cases where

long action sequences need to be represented.

Also considering update noise, its magnitude is related to the magnitude of the

optimal value function, as demonstrated in Fig. 9.4. Therefore, the noise appears

to be largest where the magnitude of the optimal value function is large. Due

to this noise, the model in such areas will most likely be coarse. With respect

to the corridor finite state world, for which the optimal value function is shown

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home