Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

10.3 Dynamic Programming

The term dynamic programming (DP) refers to a collection of algorithms that can

be used to compute optimal policies given a perfect model of the environment as

a Markov decision process (MDP). Classical DP algorithms are of limited utility

in RL because of their assumption of a perfect model and their great

computational expense, but they are still important theoretically. DP provides an

essential foundation for the understanding of the methods presented in the rest of

this topic. In fact, all of these methods can be viewed as attempts to achieve the

same effect as DP, with less computation and without assuming a perfect model

of the environment.

First we consider how to compute the state-value function

ʩ for an arbitrary

policy ʩ . This is called policy evaluation in the DP literature. We also refer to it as

the prediction problem. For all s in S,

= Ã

(

)

(

)

(

→

)

(

→

)

(

))

(10.9)

where ʩ (

under policy ʩ , and

the expectations are subscripted by ʩ to indicate that they are conditional on ʩ

being followed. The existence and uniqueness of

) is the probability of taking action

in state

are guaranteed as long as

either ȳ 1 or eventual termination is guaranteed from all states under the policy

ʩ . Fig. 10.3 illustrates the first step of the computing process, and the three

subsequence state of

t are all known. As for policy ʩ , the probability of the

is ʩ (

action

). For each state, the environment may respond to one of the

states to

. Bellman equation is adopted to average the

probability of these, without weight. It is noted that the value of starting state

must look forward to the discount value of the next state, ȳ , and the reward

obtained with the path. In dynamic programming, if n and m are the number of

states and action, although the total number of the policy is nm, a dynamic

planning method can be guaranteed in polynomial time to find the optimal

strategy. In this sense, dynamic programming strategy is faster than any of the

direct search index-class, and has polynomial time of action and states. However,

if the state is based on the exponential growth of certain variables, of course, it

will also appear dimensions of the disaster.

In dynamic programming iterative equation (10.10) can be derived from

equation (10.4) :

with a reward

1 ( )

←

[

ℜ

(

)]

MAX

(10.10)

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home