Information Technology Reference
In-Depth Information
are sampled according to the steady-state distribution of the Markov chain P μ
[215]. As long as the states are visited by performing actions according to μ ,
such a sampling regime is guaranteed. On the other hand, counterexamples where
sampling of the states does not follow this distribution were shown to potentially
lead to divergence [8, 25, 96, 214].
The same stability issues also apply to the related RL methods: Q-Learning
was shown to diverge in some cases when used with linear approximation archi-
tectures [27], analogous to approximate value iteration. Thus, special care needs
to be taken when Q-Learning is used in LCS.
To summarise, if Π is a non-expansion with respect to
· ,itsusefor
approximate value iteration and policy iteration is guaranteed to be stable. If
it is a non-expansion with respect to
· D , then its is stable when used for
approximate policy iteration, but its stability with approximate value iteration
is not guaranteed. Even if the function approximation method is a non-expansion
to neither of these norms, this does not necessarily mean that it will diverge when
used with approximate DP. However, one needs to resort to other approaches
than contraction and non-expansion to analyse its stability.
9.4.2
Stability on the Structure and the Parameter Learning Level
Approximating the action-value function with LCS requires on one hand to find
a good set of classifiers, and on the other hand to correctly learn the parameters
of that set of classifiers. In other words, we want to find a good model structure
M
and the correct model parameter θ for that structure, as discussed in Chap. 3.
Incremental learning can be performed on the structure level as well as the
parameter level (see Sect. 8.4.2). Similarly, stability of using LCS with DP can
be considered at both of these levels.
Stability on the Structure Learning Level
Divergence of DP with function approximation is expressed by the values of the
value function estimate rapidly growing out of bounds (for example, [25]). Let us
assume that for some fixed LCS model structure, the parameter learning process
diverges when used with DP, and that there exist model structures for which
this is not the case.
Divergence of the parameters usually happens locally, that is, not for all clas-
sifiers at once. Therefore, it can be detected by monitoring the model error of
single classifiers, which, for linear classifier models as given in Chap. 5, would
be the model noise variance. Subsequently, divergent classifiers can be detec-
ted and replaced until the model structure allows the parameter learning to be
compatible with the used DP method.
XCSF uses linear classifier models and Q-Learning, but such combinations
are known to be unstable [25]. However, XCSF has never been reported to show
divergent behaviour. Thus, it is conjectured that it provides stability on the
model structure level by replacing divergent classifiers with potentially better
ones.
 
Search WWH ::




Custom Search