Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

are sampled according to the steady-state distribution of the Markov chain P μ

[215]. As long as the states are visited by performing actions according to μ ,

such a sampling regime is guaranteed. On the other hand, counterexamples where

sampling of the states does not follow this distribution were shown to potentially

lead to divergence [8, 25, 96, 214].

The same stability issues also apply to the related RL methods: Q-Learning

was shown to diverge in some cases when used with linear approximation archi-

tectures [27], analogous to approximate value iteration. Thus, special care needs

to be taken when Q-Learning is used in LCS.

To summarise, if Π is a non-expansion with respect to

· ∞ ,itsusefor

approximate value iteration and policy iteration is guaranteed to be stable. If

it is a non-expansion with respect to

· D , then its is stable when used for

approximate policy iteration, but its stability with approximate value iteration

is not guaranteed. Even if the function approximation method is a non-expansion

to neither of these norms, this does not necessarily mean that it will diverge when

used with approximate DP. However, one needs to resort to other approaches

than contraction and non-expansion to analyse its stability.

9.4.2

Stability on the Structure and the Parameter Learning Level

Approximating the action-value function with LCS requires on one hand to find

a good set of classifiers, and on the other hand to correctly learn the parameters

of that set of classifiers. In other words, we want to find a good model structure

M

and the correct model parameter θ for that structure, as discussed in Chap. 3.

Incremental learning can be performed on the structure level as well as the

parameter level (see Sect. 8.4.2). Similarly, stability of using LCS with DP can

be considered at both of these levels.

Stability on the Structure Learning Level

Divergence of DP with function approximation is expressed by the values of the

value function estimate rapidly growing out of bounds (for example, [25]). Let us

assume that for some fixed LCS model structure, the parameter learning process

diverges when used with DP, and that there exist model structures for which

this is not the case.

Divergence of the parameters usually happens locally, that is, not for all clas-

sifiers at once. Therefore, it can be detected by monitoring the model error of

single classifiers, which, for linear classifier models as given in Chap. 5, would

be the model noise variance. Subsequently, divergent classifiers can be detec-

ted and replaced until the model structure allows the parameter learning to be

compatible with the used DP method.

XCSF uses linear classifier models and Q-Learning, but such combinations

are known to be unstable [25]. However, XCSF has never been reported to show

divergent behaviour. Thus, it is conjectured that it provides stability on the

model structure level by replacing divergent classifiers with potentially better

ones.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home