Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

advantage of constant step sizes is that we need not save the step-size value k ,sowe

only need to update our estimate X k .

In practice, the average or the constant step size is often used, and the importance

of the correct step-size parameter

α k is totally underestimated. Without going into

further details here, we stress once more the theoretical and practical importance of

this aspect for achieving a quick convergence of the process.

Update equations of the form ( 3.8 ) thus enable us to calculate the transition

probabilities p ss 0 and rewards r ss 0 incrementally. But how can we perform the policy

iteration adaptively? It would be extremely computationally intensive to determine

it from scratch in every update step. The inherently adaptive approach in fact

permits a derived adaptive variant, asynchronous dynamic programming (ADP).

Here the sequence of policies and action-value functions is executed almost as the

realtime interaction proceeds, coupled with additional internal updates in order to

ensure convergence.

This is also interesting in that it enables the explorative mode to be designed in a

much more sophisticated way than just using the simple

-greedy and softmax

policies. In explorative mode, therefore, those actions that can most quickly reduce

the statistical uncertainty in our system are selected, thus ensuring the most rapid

convergence. Selecting the correct actions is generally one of the most interesting

subject areas in RL.

3.8 The Model-Free Approach

The determination of the transition probabilities p ss 0 and rewards r ss 0 is often

computationally intensive and memory intensive, especially for large state and

action spaces S and A . This raises the question of whether the Bellman equation

( 3.4 ) cannot also be solved indirectly even without a model of the environment.

In fact, this can be done, and the corresponding method is referred to as model-free .

In the following, we shall briefly present the most important model-free algorithm,

temporal-difference learning (TD) developed by Sutton.

The model-free approach is based on learning by iterative adaptation of the

action-value function q(s, a). We begin with the simple TD(0) method. At every

step t of the episode, the update is performed as follows:

qs t ;

a t

Þ :¼ qs t ;

a t

Þþα t r tþ 1 þ γ

qs tþ 1 ;

a tþ 1

Þ qs t ;

a t

Þ:

ð 3

10 Þ

Obviously, the equation is of the same form as the update ( 3.8 ), where

qs t ;

a t

Þ :

¼ r tþ 1 þ γ

ð Þ is the target variable to be estimated. Please note that here,

t acts as an index for the state-action pair ( s t , a t ) rather than the update step k for the

value q ( s t , a t ). Strictly speaking, we must write

qs tþ 1 ;

a tþ 1

q kþ 1

Þ¼q k

q k

Þ q k

s t ;

a t

s t ;

a t

Þþαð kþ 1

r tþ 1 þ γ

s tþ 1 ;

a tþ 1

s t ;

a t

Realtime Data Mining

Search WWH ::

Custom Search

Home