Information Technology Reference
In-Depth Information
policy ʩ from a state s is defined as the expectation of the sum of the subsequent
rewards, r 1 , r 2 . . ., each discounted geometrically by its delay as follows:
2
R
=
r
+
γ
r
+
γ
r
+
=
r
+
γ
R
?
(10.2)
t
t
+
1
t
+
2
t
+
3
t
+
1
t
+
1
(10.3)
{
}
{
}
Ã
Ã
( )
(
)
(
)
( )
π
a
Ç
a
π
×
V
s
=
E
R s
=
s
=
E
r
+
γ
V
s
s
=
s
=
π
s a
,
P
R
+
γ
V
s
É
Ù
π
t
t
π
t
+
1
t
+
1
t
ss
ss
a
s
Values determine a partial ordering over policies, whereby ʩ 1 ²ʩ 2 if and only
if V ʩ 1 (
), s . Ideally, we seek an optimal policy ʩ *, one that is great or
equal than all others. All such policies share the same optimal value function.
According to Bellman optimality equations, the value function V*(
s
) ² V ʩ 2 (
s
s)
of optimal
policy ʩ * at state
s
could be defined as follows.
( )
{
}
(
)
*
*
V
s
=
max
E r
+
γ
V
s
s
=
s a
,
=
a
t
+
1
t
+
1
t
t
( )
a
A s
(10.4)
Ã
(
)
a
Ç
a
*
×
=
max
P
R
+
γ
V
s
É
Ù
ss
ss
( )
a
A s
s
Dynamic
programming
methods
involve
iteratively
updating
an
approximation to the optimal value function. If the state-transition function
P
and
the expected rewards
are known, a typical example is value iteration, which
starts with an arbitrary policy ʩ 0 , and then
( )
R
Ã
( )
a
Ç
a
π
×
π
s
=
arg max
P
R
+
γ
V
s
10.5
k
1
É
Ù
k
ss
ss
a
s
Ã
Ã
( )
(
)
(
)
π
a
Ç
a
π
×
10.6
V
s
π
s a
,
P
R
+
γ
V
s
k
k
1
É
Ù
k
1
ss
ss
a
s
In RL, without knowledge of the system's dynamics, we cannot compute the
expected value by Equation (10.5) and (10.6). It is necessary to estimate the
value by iteratively updating an approximation to the optimal value function, and
Monte Carlo sampling is one of the basic methods. Keeping policy ʩ , iteratively
using Equation (10.7) to obtain approximate solutions.
(
)
(
)
(
)
Ç
×
V
s
V
s
+
α
R
V
s
Ù 10.7
É
t
t
t
t
Combining Monte Carlo method and dynamic programming method,
equation (10.8) gives the iterative equation of Temporal-difference learning
(TD).
(
)
(
)
(
)
(
)
Ù 10.8
V
s
V
s
+
α
Ç
r
+
γ
V
s
V
s
×
É
t
t
t
+
1
t
+
1
t
Search WWH ::




Custom Search