Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

TD( ȹ ), the value function will be updated by Equation (10.17) through e(s). An

complete algorithm for on-line TD( ȹ ) is given in Algorithm 10.2.

Algorithm 10.2 TD( ȹ ) Algorithm.

Initialize V ( s ) arbitrarily and e ( s )=0 for all s ∈ S

Repeat (for each episode)

Initialize s

Repeat (for each step of episode)

a ŕ action given by for s (e.g., ō -greedy)

Take action a , observer r , s ƪ

( )

δ

←

r

+

γ

V

s

′

−

V

s

( )

e s

←

e s

+

1

for all s

( )

V

s

←

V

s

+

αδ

e s

( )

e s

←

γλ

e s

s ŕ s ƪ

Until s is terminal

We could combine the two steps of estimate and evaluation of value function

to construct value function of state-action pair, Q function. In Q-learning, the

learned action-value function Q, directly approximates Q*, the optimal

action-value function, independent of the policy being followed. The policy still

has an effect in that it determines which state-action pairs are visited and updated.

However, all that are required for correct convergence is that all pairs continue to

be updated. This is the minimal requirement in the sense that any method

guaranteed to find optimal behavior in the general case must require it. Under

this assumption and a variant of the usual stochastic approximation conditions on

the sequence of step-size parameters, Q t has been shown to converge with

probability 1 to Q*. The Q-learning algorithm is shown in procedural form in

Algorithm 10.3.

Algorithm 10.3 Q-Learning Algorithm

Initialize Q ( s , a ) arbitrarily

Repeat (for each episode)

Initialize s

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home