Information Technology Reference
In-Depth Information
Various algorithms have been suggested for applying the methodology of
temporal difference to various problems ion such areas as games, optimal
planning, and combinatorial optimization. Convergence of those algorithms
has been proved, and the following general framework of eligibility traces was
first described in [Bertsekas et al. 1996].
In that framework, k is an integer that labels the steps of the algorithm.
At step k , an initial state x 0 is chosen. That choice depends on the past
history of the algorithm implementation, and guarantees that each state is
visited frequently enough. Then, the current policy is applied for N instants
and a trajectory of length k =( x 0 ,x 1 ,...,x k m ,...,x N ) generated. The
associated costs are observed and the temporal differences d k m are computed.
Then a finite sequence of positive state functions z k m is chosen. That se-
quence is called an eligibility trace; it has the following properties (index m
labels time along the trajectory; δ is the Kronecker delta function):
z 0 ( x )= δ x 0 ( x ), and, in addition, z k m ( x )=1when m is the first visiting
time of state x in trajectory ω k .
z k m +1 ( x )
z k m ( x )+ δ x k m +1
( x ).
of state functions
that take their values in [0, 1]. That sequence is the sequence of gains or learn-
ing rates, and it obeys the classical assumptions of stochastic approximation
theory
In addition, let us consider a decreasing sequence
{
γ k }
γ k ( x )=
.
k
γ k ( x ) 2 <
.
k
Then, the generalized TD algorithm updates the estimation of the cost ac-
cording to
J k +1 ( x )= J k ( x )+ γ k ( x ) N− 1
z k m ( x ) d k m .
m =0
It converges almost surely towards J π . Therefore, the estimation is consistent.
For instance, TD( λ ) algorithm, which has been previously described, is a
particular eligibility trace algorithm where the trace decreases according to
the constant multiplicative rate αλ .
5.4.2.3 Back to Actor-Critics Methodology and Optimistic
Iteration of Policy
On many real-life problems, the assessment of a given policy may be a question
per se. The previous algorithms are designed for that purpose. However, we
introduced the valuation of a policy as a step in the computation loop aimed
Search WWH ::




Custom Search