Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

9.4

Stability of RL with LCS

An additional challenge when using LCS for sequential decision tasks is that

some combinations of DP and function approximation can lead to instabilities

and divergence, and the same applies to RL. In fact, as RL is based on DP,

convergence of RL with value function approximation is commonly analysed by

showing that the underlying DP method is stable when used with this function

approximation methods, and that the difference between the RL and the DP

methods asymptotically converges to zero (for example, [215, 17, 16, 129]).

In this section we investigate whether the LCS approximation architecture

of our LCS model type is stable when used with DP. While value iteration is

certainly the most critical method, as Q-Learning is based on it, the use of LCS

with policy iteration is also discussed. No conclusive answers are provided, but

initial results are presented that can lead to such answers.

Firstly, let us have a closer look at the compatibility of various function ap-

proximation architecture with value iteration and policy iteration, followed by

a short discussion on the issue of stability on learning model parameters and

model structure of the LCS model. This is followed by a more detailed analysis

of the compatibility of the LCS model structure with value and policy iteration,

to act as the basis of further investigations of RL with LCS. Note that in the

following, a method that is known not to diverge is automatically guaranteed to

converge. Thus, showing stability of RL with LCS implies that this combination

converges.

9.4.1

Stability of Approximate Dynamic Programming

Approximate value iteration (9.13) is based on the operator conjunction ΠT,

where T is a nonlinear operator. As shown by Boyan and Moore [25], this pro-

cedure might diverge when used with even the most common approximation

architectures, such as linear or quadratic regression, local weighted regression,

or neural networks. Gordon [96] has shown that stability is guaranteed if the

approximation Π is - just like T - a non-expansion to the maximum norm, that

is, if for any two V , V we can guarantee

V ∞ .Thisis

due to the fact that combining a contraction and a non-expansion to the same

norm results in a contraction. The class of averagers satisfy this requirement,

and contain “[ ... ] local weighted averaging, k-nearest neighbour, Bezier patches,

linear interpolation, bilinear interpolation on a square (or cubical, etc.) mesh, as

well as simpler methods like grids and other state aggregations.” [96].

Due to the linearity of T μ , approximate policy iteration has less stability

problems that approximate value iteration. Just as T, T μ is a contraction with

respect to the maximum norm, and is thus guaranteed to be stable if used

in combination with an approximation Π that is a non-expansion to the same

norm. Also, note that T ( λ )

μ

Π V ∞ ≤

Π V

−

V

−

forms a contraction mapping with respect to

· D ,

T (0 μ . Thus, another stable option is for Π to be a non-expansion

with respect to

and T μ

≡

· ∞ . This property was used to show that

approximate policy iteration is guaranteed to converge, as long as the states

· D rather than

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home