Information Technology Reference
In-Depth Information
10.3 Dynamic Programming
The term dynamic programming (DP) refers to a collection of algorithms that can
be used to compute optimal policies given a perfect model of the environment as
a Markov decision process (MDP). Classical DP algorithms are of limited utility
in RL because of their assumption of a perfect model and their great
computational expense, but they are still important theoretically. DP provides an
essential foundation for the understanding of the methods presented in the rest of
this topic. In fact, all of these methods can be viewed as attempts to achieve the
same effect as DP, with less computation and without assuming a perfect model
of the environment.
First we consider how to compute the state-value function
ʩ for an arbitrary
policy ʩ . This is called policy evaluation in the DP literature. We also refer to it as
the prediction problem. For all s in S,
V
= Ã
Ã
π
'
a
'
π
'
V
(
s
)
π
(
a
|
s
)
×
π
(
s
s
|
a
)
×
(
R
(
s
s
)
+
γ
(
V
(
s
))
(10.9)
'
a
s
where ʩ (
under policy ʩ , and
the expectations are subscripted by ʩ to indicate that they are conditional on ʩ
being followed. The existence and uniqueness of
a
|
s
) is the probability of taking action
a
in state
s
are guaranteed as long as
either ȳ 1 or eventual termination is guaranteed from all states under the policy
ʩ . Fig. 10.3 illustrates the first step of the computing process, and the three
subsequence state of
V
s
t are all known. As for policy ʩ , the probability of the
is ʩ (
action
a
a
|
s
). For each state, the environment may respond to one of the
states to
. Bellman equation is adopted to average the
probability of these, without weight. It is noted that the value of starting state
must look forward to the discount value of the next state, ȳ , and the reward
obtained with the path. In dynamic programming, if n and m are the number of
states and action, although the total number of the policy is nm, a dynamic
planning method can be guaranteed in polynomial time to find the optimal
strategy. In this sense, dynamic programming strategy is faster than any of the
direct search index-class, and has polynomial time of action and states. However,
if the state is based on the exponential growth of certain variables, of course, it
will also appear dimensions of the disaster.
In dynamic programming iterative equation (10.10) can be derived from
equation (10.4) :
s'
with a reward
r
Ã
a
a
'
V
1 ( )
s
P
[
+
γ
V
(
s
)]
MAX
(10.10)
t
+
'
'
t
ss
ss
a
'
s
Search WWH ::




Custom Search