Information Technology Reference
In-Depth Information
Then the estimation of J π
is updated for each state x k
of the trajectory by
the formula
J +
π
( x k )= J π ( x k )+ γ [ d k + αd k +1 +
+ α N−k− 1 d N− 1 ] ,
···
for all k
. Note that the incremental implementation of that
update amounts to the batch rule if the updates are performed backwards in
time, as for the backpropagation of the error signal.
∈{
0 ,...,N
1
}
5.4.2.2 TD( λ ) Algorithm and Eligibility Trace Method
In the previous section, we presented an algorithm that, in order to update
the cost function at state x , takes into account either an immediate transition
from x , or following transitions on a given horizon N . All those algorithms
converge. However, their convergence speed relies on the way the informa-
tion provided by the trajectory is taken into account. Indeed, it may seem
inaccurate to update J π ( x ) by assigning the same weight to the contributions
coming from the consecutive transition and from remote future transitions
that are more unlikely to be observed. It has been proposed in a basic on re-
inforcement learning [Barto et al. 1983], to weight the information brought by
future transitions with a discount rate λ
[0, 1]. That suggests the following
updating algorithm, which is called TD( λ ):
J +
π
( x k )= J π ( x k )+ γ [ d k + αλd k +1 + ... +( αλ ) N−k− 1 d N− 1 ] ,
for all k
. Historically, the discount rate λ was first intended to
deal with finite horizon problems or shortest stochastic path problems where
the cost was not discounted by the discount rate α . The introduction of the
discount rate λ in that context was a new idea, in infinite horizon problems
with discounted costs; it is simply equivalent to a change of the discount rate.
The convergence of the algorithms TD( λ ) is just guaranteed by the usual
hypotheses of the stochastic approximation [Sutton 1988]. In particular, all
the states should be visited infinitely often, i.e., in practice, su ciently fre-
quently. It is especially important for the states that are relevant for the
optimal policy. But those states cannot be determined at the initialization
of the algorithm. In following sections, we will emphasize the “exploration
policy” in reinforcement learning algorithms. If a simulator is used to provide
information, one has to make sure that the hypothesis is approximately true
by randomly initializing the trajectory from time to time, preventing it form
being confined to a specific region of state space. When a real experiment
is performed, one has to make sure that the feasible states are su ciently
explored given the experimental constraints, and to settle a tradeoff between
the need of complete exploration and the cost of the experiment. Otherwise,
the algorithm may converge towards local minima of the cost function that
are not optimal.
∈{
1 ,...,N
1
}
Search WWH ::




Custom Search