Information Technology Reference
In-Depth Information
9.4.5
Consequences for XCS and XCSF
Both XCS and XCSF use Q-Learning as their reinforcement learning component.
To show that this combination is stable, the first step to take is to show the
stability of value iteration with the model structure underlying XCS and XCSF.
XCS uses averaging classifiers, which were shown to be stable with value
iteration. Stability at the global model level is very likely, but depends on the
mixing model, and definite results are still pending.
XCSF in its initial implementation [240, 241], on the other hand, uses classi-
fiers that model straight lines, and such models are known to be unstable with
value iteration. Thus, XCSF does not even guarantee stability at the classifier
level, and therefore neither at the global model level. As previously mentioned,
it is conjectured that XCSF provides its stability at the structure learning level
instead, which is not considered as being a satisfactory solution. Instead, one
should aim at replacing the classifier models such that stability at the parame-
ter level can be guaranteed. Averaging RL seems to be a good starting point,
but how exactly this can be done still needs to be investigated.
9.5
Further Issues
Besides the stability concerns when using LCS to perform RL, there are still
some further issues to consider, two of which will be discussed in this section:
the learning of long paths, and how to best handle the explore/exploit dilemma.
9.5.1
Long Path Learning
The problem of long path learning is to find the optimal policy in sequential
decision tasks when the solution requires learning of action sequences of sub-
stantial length. As identified by Barry [11, 12], XCS struggles with such tasks
due to the generalisation method that it uses.
While a solution was proposed to handle this problem [12], it was only designed
to work for a particular problem class, as will be shown after discussing how XCS
fails at long path learning. The classifier set optimality criterion from Chap. 7
might provide better results, but in general, long path learning remains an open
problem.
Long path learning is not only an issue for LCS, but for approximate DP
and RL in general. It arises from the used approximation glossing over small
differences in the value or action-value function, which causes the policy that
is derived from this function to become sub-optimal. This effect is amplified by
weak discounting (that is, for γ close to 1) which causes the expected return to
differ less between adjacent states.
XCS and Long Path Learning
Consider the problem that is shown in Fig. 9.2. The aim is to find the policy that
reaches the terminal state x 6 from the initial state x 1 a in the shortest number of
 
Search WWH ::




Custom Search