Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

9.4.5

Consequences for XCS and XCSF

Both XCS and XCSF use Q-Learning as their reinforcement learning component.

To show that this combination is stable, the first step to take is to show the

stability of value iteration with the model structure underlying XCS and XCSF.

XCS uses averaging classifiers, which were shown to be stable with value

iteration. Stability at the global model level is very likely, but depends on the

mixing model, and definite results are still pending.

XCSF in its initial implementation [240, 241], on the other hand, uses classi-

fiers that model straight lines, and such models are known to be unstable with

value iteration. Thus, XCSF does not even guarantee stability at the classifier

level, and therefore neither at the global model level. As previously mentioned,

it is conjectured that XCSF provides its stability at the structure learning level

instead, which is not considered as being a satisfactory solution. Instead, one

should aim at replacing the classifier models such that stability at the parame-

ter level can be guaranteed. Averaging RL seems to be a good starting point,

but how exactly this can be done still needs to be investigated.

9.5

Further Issues

Besides the stability concerns when using LCS to perform RL, there are still

some further issues to consider, two of which will be discussed in this section:

the learning of long paths, and how to best handle the explore/exploit dilemma.

9.5.1

Long Path Learning

The problem of long path learning is to find the optimal policy in sequential

decision tasks when the solution requires learning of action sequences of sub-

stantial length. As identified by Barry [11, 12], XCS struggles with such tasks

due to the generalisation method that it uses.

While a solution was proposed to handle this problem [12], it was only designed

to work for a particular problem class, as will be shown after discussing how XCS

fails at long path learning. The classifier set optimality criterion from Chap. 7

might provide better results, but in general, long path learning remains an open

problem.

Long path learning is not only an issue for LCS, but for approximate DP

and RL in general. It arises from the used approximation glossing over small

differences in the value or action-value function, which causes the policy that

is derived from this function to become sub-optimal. This effect is amplified by

weak discounting (that is, for γ close to 1) which causes the expected return to

differ less between adjacent states.

XCS and Long Path Learning

Consider the problem that is shown in Fig. 9.2. The aim is to find the policy that

reaches the terminal state x 6 from the initial state x 1 a in the shortest number of

Search WWH ::

Custom Search

Home