Information Technology Reference
In-Depth Information
The operator T is given a value vector V and returns a new value vector that
is based on Bellman's Equation (9.6). The i th element (T V ) i of the resulting
vector T V is given by
x i ,a ) r x i x j ( a )+ γ V j .
(T V ) i =max
a∈A
p ( x j |
(9.9)
x j ∈X
Similarly, for a fixed policy μ the operator T μ is based on (9.7), and is given by
(T μ V ) i =
x j ∈X
x i ) r x i x j + γ V j ,
p μ ( x j |
(9.10)
which, in matrix notation, is T μ V = r μ + γ P μ V .
The probably most important property of both T and T μ is that they form
a contraction mapping to the maximum norm [17]; that is, given two arbitrary
vectors V , V ,wehave
T V
V ,
T V
γ
V
and
(9.11)
T μ V
V ,
γ
T μ V
V
(9.12)
where
is the maximum norm of V . Thus, every update with
TorT μ reduces the maximum distance between V and V by at least the factor
γ . Applying them repeatedly will therefore lead us to some fixed point T V = V
or T μ V = V , that is, according to the Banach Fixed Point Theorem [230],
unique.
Further properties of the DP operators are that the optimal value vector
V and the value vector V μ for policy μ are the unique vectors that satisfy
T V = V and T μ V μ = V μ , respectively, which follows from Bellman's Equa-
tions (9.6) and (9.7). As these vectors are the fixed points of T and T μ ,ap-
plying the operators repeatedly causes convergence to these vectors, that is,
V = lim n→∞ T n V ,and V μ = lim n→∞ T μ V for an arbitrary V ,whereT n and
T μ denote n applications of T and T μ , respectively. A policy μ is optimal if and
only if T μ V =T V . Note that, even though V is unique, there can be several
optimal policies [17].
V
=max i |
V i |
9.2.2
Value Iteration and Policy Iteration
The method of value iteration is a straightforward application of the contraction
property of T and is based on applying T repeatedly to an initially arbitrary
value vector V until it converges to the optimal value vector V . Convergence
can only be guaranteed after an infinite number of steps, but the value vector
V is usually already close to V after few iterations.
As an alternative to value iteration, policy iteration will converge after a finite
number of policy evaluation and policy improvement steps. Given a fixed policy
μ t , policy evaluation finds the value vector for this policy by solving T μ t V μ t =
V μ t . The policy improvement steps generates a new policy μ t +1 basedonthe
 
Search WWH ::




Custom Search