Information Technology Reference
In-Depth Information
The operator T is given a value vector
V
and returns a new value vector that
is based on Bellman's Equation (9.6). The
i
th element (T
V
)
i
of the resulting
vector T
V
is given by
x
i
,a
)
r
x
i
x
j
(
a
)+
γ
V
j
.
(T
V
)
i
=max
a∈A
p
(
x
j
|
(9.9)
x
j
∈X
Similarly, for a fixed policy
μ
the operator T
μ
is based on (9.7), and is given by
(T
μ
V
)
i
=
x
j
∈X
x
i
)
r
x
i
x
j
+
γ
V
j
,
p
μ
(
x
j
|
(9.10)
which, in matrix notation, is T
μ
V
=
r
μ
+
γ
P
μ
V
.
The probably most important property of both T and T
μ
is that they form
a contraction mapping to the maximum norm [17]; that is, given two arbitrary
vectors
V
,
V
,wehave
T
V
∞
≤
V
∞
,
T
V
−
γ
V
−
and
(9.11)
T
μ
V
∞
≤
V
∞
,
−
γ
−
T
μ
V
V
(9.12)
where
is the maximum norm of
V
. Thus, every update with
TorT
μ
reduces the maximum distance between
V
and
V
by at least the factor
γ
. Applying them repeatedly will therefore lead us to some fixed point T
V
=
V
or T
μ
V
=
V
, that is, according to the
Banach Fixed Point Theorem
[230],
unique.
Further properties of the DP operators are that the optimal value vector
V
∗
and the value vector
V
μ
for policy
μ
are the unique vectors that satisfy
T
V
∗
=
V
∗
and T
μ
V
μ
=
V
μ
, respectively, which follows from Bellman's Equa-
tions (9.6) and (9.7). As these vectors are the fixed points of T and T
μ
,ap-
plying the operators repeatedly causes convergence to these vectors, that is,
V
∗
= lim
n→∞
T
n
V
,and
V
μ
= lim
n→∞
T
μ
V
for an arbitrary
V
,whereT
n
and
T
μ
denote
n
applications of T and T
μ
, respectively. A policy
μ
is optimal if and
only if T
μ
V
∗
=T
V
∗
. Note that, even though
V
∗
is unique, there can be several
optimal policies [17].
V
∞
=max
i
|
V
i
|
9.2.2
Value Iteration and Policy Iteration
The method of value iteration is a straightforward application of the contraction
property of T and is based on applying T repeatedly to an initially arbitrary
value vector
V
until it converges to the optimal value vector
V
∗
. Convergence
can only be guaranteed after an infinite number of steps, but the value vector
V
is usually already close to
V
∗
after few iterations.
As an alternative to value iteration, policy iteration will converge after a finite
number of
policy evaluation
and
policy improvement
steps. Given a fixed policy
μ
t
, policy evaluation finds the value vector for this policy by solving T
μ
t
V
μ
t
=
V
μ
t
. The policy improvement steps generates a new policy
μ
t
+1
basedonthe
Search WWH ::
Custom Search