Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

current V μ t , such that T μ t +1 V μ t =T V μ t . Starting with an initially random

policy μ 0 , the sequence of policies

generated by iterating policy

evaluation and policy improvement is guaranteed to converge to the optimal

policy within a finite number of iterations [17].

Various variants to these methods exist, such as asynchronous value iteration,

that at each application of T only updates a single state of V . Modified policy

iteration performs the policy evaluation step by approximating V μ by T μ V for

some small n . Asynchronous policy iteration mixes asynchronous value iteration

with policy iteration by at each step either i) updating some states of V by

asynchronous value iteration, or ii) improving the policy of some set of states

by policy improvement. Convergence criteria for these variants are given by

Bertsekas and Tsitsiklis [17].

{

μ 0 ,μ 1 ,...

}

9.2.3

Approximate Dynamic Programming

If N is large, we prefer to approximate the value function rather than represen-

ting the value for each state explicitly. Let V denote the vector that holds the

value function approximations for each state, as generated by a function appro-

ximation technique as an approximation to V . Approximate value iteration is

performed by approximating the value iteration update V t +1 =T V t by

V t +1 =ΠT V t ,

(9.13)

where Π is the approximation operator that, for the used function approximation

technique, returns the value function estimate approximation V t +1 that is closest

to V t +1 =T V t by V t +1 =argmin V

. Even though conceptually

simple, approximate value iteration was shown to diverge even when used in

combination with the simplest function approximation techniques [25]. Thus,

special care needs to be take when applying this method, as will be discussed in

more detail in Sect. 9.4.

Approximate policy iteration, on the other hand, has less stability problems,

as the operator T μ used for the policy evaluation step is linear. While the po-

licy improvement step is performed as for standard policy iteration, the policy

evaluation step is based on an approximation of V μ .AsT μ is linear, there are

several possibilities of how to perform the approximation, which are outlined by

Schoknecht [193]. Here, the only approximation that will be considered is the

one most similar to approximation value iteration and is the temporal-difference

solution which aims at finding the fixed point

−

V t +1

V μ

=ΠT μ V μ

by the update

V μ

t +1 =ΠT μ V μ

t [194, 195].

9.2.4

Temporal-Difference Learning

Even thought temporal-difference (TD) learning is an incremental method for

policy evaluation that was initially developed by Sutton [207] as a modification of

the Widrow-Hoff rule [234], we here only concentrate the TD( λ )operatorT ( λ )

as it forms the basis of SARSA( λ ), and gives us some necessary information

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home