Information Technology Reference
In-Depth Information
current V μ t , such that T μ t +1 V μ t =T V μ t . Starting with an initially random
policy μ 0 , the sequence of policies
generated by iterating policy
evaluation and policy improvement is guaranteed to converge to the optimal
policy within a finite number of iterations [17].
Various variants to these methods exist, such as asynchronous value iteration,
that at each application of T only updates a single state of V . Modified policy
iteration performs the policy evaluation step by approximating V μ by T μ V for
some small n . Asynchronous policy iteration mixes asynchronous value iteration
with policy iteration by at each step either i) updating some states of V by
asynchronous value iteration, or ii) improving the policy of some set of states
by policy improvement. Convergence criteria for these variants are given by
Bertsekas and Tsitsiklis [17].
{
μ 0 1 ,...
}
9.2.3
Approximate Dynamic Programming
If N is large, we prefer to approximate the value function rather than represen-
ting the value for each state explicitly. Let V denote the vector that holds the
value function approximations for each state, as generated by a function appro-
ximation technique as an approximation to V . Approximate value iteration is
performed by approximating the value iteration update V t +1 =T V t by
V t +1 =ΠT V t ,
(9.13)
where Π is the approximation operator that, for the used function approximation
technique, returns the value function estimate approximation V t +1 that is closest
to V t +1 =T V t by V t +1 =argmin V
V
. Even though conceptually
simple, approximate value iteration was shown to diverge even when used in
combination with the simplest function approximation techniques [25]. Thus,
special care needs to be take when applying this method, as will be discussed in
more detail in Sect. 9.4.
Approximate policy iteration, on the other hand, has less stability problems,
as the operator T μ used for the policy evaluation step is linear. While the po-
licy improvement step is performed as for standard policy iteration, the policy
evaluation step is based on an approximation of V μ .AsT μ is linear, there are
several possibilities of how to perform the approximation, which are outlined by
Schoknecht [193]. Here, the only approximation that will be considered is the
one most similar to approximation value iteration and is the temporal-difference
solution which aims at finding the fixed point
V t +1
V μ
=ΠT μ V μ
by the update
V μ
t +1 =ΠT μ V μ
t [194, 195].
9.2.4
Temporal-Difference Learning
Even thought temporal-difference (TD) learning is an incremental method for
policy evaluation that was initially developed by Sutton [207] as a modification of
the Widrow-Hoff rule [234], we here only concentrate the TD( λ )operatorT ( λ )
μ
as it forms the basis of SARSA( λ ), and gives us some necessary information
 
Search WWH ::




Custom Search