Database Reference
In-Depth Information
advantage of constant step sizes is that we need not save the step-size value k ,sowe
only need to update our estimate X k .
In practice, the average or the constant step size is often used, and the importance
of the correct step-size parameter
α k is totally underestimated. Without going into
further details here, we stress once more the theoretical and practical importance of
this aspect for achieving a quick convergence of the process.
Update equations of the form ( 3.8 ) thus enable us to calculate the transition
probabilities p ss 0 and rewards r ss 0 incrementally. But how can we perform the policy
iteration adaptively? It would be extremely computationally intensive to determine
it from scratch in every update step. The inherently adaptive approach in fact
permits a derived adaptive variant, asynchronous dynamic programming (ADP).
Here the sequence of policies and action-value functions is executed almost as the
realtime interaction proceeds, coupled with additional internal updates in order to
ensure convergence.
This is also interesting in that it enables the explorative mode to be designed in a
much more sophisticated way than just using the simple
-greedy and softmax
policies. In explorative mode, therefore, those actions that can most quickly reduce
the statistical uncertainty in our system are selected, thus ensuring the most rapid
convergence. Selecting the correct actions is generally one of the most interesting
subject areas in RL.
ε
3.8 The Model-Free Approach
The determination of the transition probabilities p ss 0 and rewards r ss 0 is often
computationally intensive and memory intensive, especially for large state and
action spaces S and A . This raises the question of whether the Bellman equation
( 3.4 ) cannot also be solved indirectly even without a model of the environment.
In fact, this can be done, and the corresponding method is referred to as model-free .
In the following, we shall briefly present the most important model-free algorithm,
temporal-difference learning (TD) developed by Sutton.
The model-free approach is based on learning by iterative adaptation of the
action-value function q(s, a). We begin with the simple TD(0) method. At every
step t of the episode, the update is performed as follows:
qs t ;
ð
a t
Þ :¼ qs t ;
ð
a t
Þþα t r 1 þ γ
ð
qs 1 ;
ð
a 1
Þ qs t ;
ð
a t
Þ
Þ:
ð 3
:
10 Þ
Obviously, the equation is of the same form as the update ( 3.8 ), where
e
qs t ;
ð
a t
Þ :
¼ r 1 þ γ
ð Þ is the target variable to be estimated. Please note that here,
t acts as an index for the state-action pair ( s t , a t ) rather than the update step k for the
value q ( s t , a t ). Strictly speaking, we must write
qs 1 ;
a 1
:
q 1
Þ¼q k
q k
Þ q k
ð
s t ;
a t
ð
s t ;
a t
Þþαð 1
r 1 þ γ
ð
s 1 ;
a 1
ð
s t ;
a t
Þ
Search WWH ::




Custom Search