Information Technology Reference
In-Depth Information
9.4
Stability of RL with LCS
An additional challenge when using LCS for sequential decision tasks is that
some combinations of DP and function approximation can lead to instabilities
and divergence, and the same applies to RL. In fact, as RL is based on DP,
convergence of RL with value function approximation is commonly analysed by
showing that the underlying DP method is stable when used with this function
approximation methods, and that the difference between the RL and the DP
methods asymptotically converges to zero (for example, [215, 17, 16, 129]).
In this section we investigate whether the LCS approximation architecture
of our LCS model type is stable when used with DP. While value iteration is
certainly the most critical method, as Q-Learning is based on it, the use of LCS
with policy iteration is also discussed. No conclusive answers are provided, but
initial results are presented that can lead to such answers.
Firstly, let us have a closer look at the compatibility of various function ap-
proximation architecture with value iteration and policy iteration, followed by
a short discussion on the issue of stability on learning model parameters and
model structure of the LCS model. This is followed by a more detailed analysis
of the compatibility of the LCS model structure with value and policy iteration,
to act as the basis of further investigations of RL with LCS. Note that in the
following, a method that is known not to diverge is automatically guaranteed to
converge. Thus, showing stability of RL with LCS implies that this combination
converges.
9.4.1
Stability of Approximate Dynamic Programming
Approximate value iteration (9.13) is based on the operator conjunction ΠT,
where T is a nonlinear operator. As shown by Boyan and Moore [25], this pro-
cedure might diverge when used with even the most common approximation
architectures, such as linear or quadratic regression, local weighted regression,
or neural networks. Gordon [96] has shown that stability is guaranteed if the
approximation Π is - just like T - a non-expansion to the maximum norm, that
is, if for any two V , V we can guarantee
V .Thisis
due to the fact that combining a contraction and a non-expansion to the same
norm results in a contraction. The class of averagers satisfy this requirement,
and contain “[ ... ] local weighted averaging, k-nearest neighbour, Bezier patches,
linear interpolation, bilinear interpolation on a square (or cubical, etc.) mesh, as
well as simpler methods like grids and other state aggregations.” [96].
Due to the linearity of T μ , approximate policy iteration has less stability
problems that approximate value iteration. Just as T, T μ is a contraction with
respect to the maximum norm, and is thus guaranteed to be stable if used
in combination with an approximation Π that is a non-expansion to the same
norm. Also, note that T ( λ )
μ
Π V
Π V
V
forms a contraction mapping with respect to
· D ,
T (0 μ . Thus, another stable option is for Π to be a non-expansion
with respect to
and T μ
· . This property was used to show that
approximate policy iteration is guaranteed to converge, as long as the states
· D rather than
 
Search WWH ::




Custom Search