Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

policy ʩ from a state s is defined as the expectation of the sum of the subsequent

rewards, r 1 , r 2 . . ., each discounted geometrically by its delay as follows:

(10.2)

(10.3)

{

}

{

}

( )

(

)

(

)

( )

′

R s

s a

′

Values determine a partial ordering over policies, whereby ʩ 1 ²ʩ 2 if and only

if V ʩ 1 (

), ∀ s . Ideally, we seek an optimal policy ʩ *, one that is great or

equal than all others. All such policies share the same optimal value function.

According to Bellman optimality equations, the value function V*(

) ² V ʩ 2 (

of optimal

policy ʩ * at state

could be defined as follows.

( )

{

}

(

)

max

E r

s a

( )

∈

A s

(10.4)

(

)

′

max

′

( )

∈

A s

′

Dynamic

programming

methods

involve

iteratively

updating

approximation to the optimal value function. If the state-transition function

and

the expected rewards

are known, a typical example is value iteration, which

starts with an arbitrary policy ʩ 0 , and then

( )

′

arg max

10.5

−

′

( )

(

)

(

)

′ ×

10.6

←

s a

−

′

In RL, without knowledge of the system's dynamics, we cannot compute the

expected value by Equation (10.5) and (10.6). It is necessary to estimate the

value by iteratively updating an approximation to the optimal value function, and

Monte Carlo sampling is one of the basic methods. Keeping policy ʩ , iteratively

using Equation (10.7) to obtain approximate solutions.

(

)

(

)

(

)

←

−

Ù 10.7

Combining Monte Carlo method and dynamic programming method,

equation (10.8) gives the iterative equation of Temporal-difference learning

(TD).

(

)

(

)

(

)

(

)

Ù 10.8

←

−

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home