Information Technology Reference
In-Depth Information
9.1.3
Problem Types
The three basic classes of infinite horizon problems are stochastic shortest path
problems, discounted problems, and average reward per step problems, all of
which are well described by Bertsekas and Tsitsiklis [17]. Here, only discounted
problems and stochastic shortest path problems are considered, where for the
latter only proper policies that are guaranteed to reach the desired terminal state
are assumed. As the analysis of stochastic shortest path problems is very simi-
lar to discounted problems, only discounted problems are considered explicitly.
These are characterised by γ< 1 and a bounded reward function to make the
values V μ ( x ) well defined.
9.1.4
Matrix Notation
Rather than representing the value function for each state explicitly, it is con-
venient to exploit the finiteness of
X
and collect the values for each state into
a vector, which also simplifies the notation. Let V =( V ( x 1 ) ,...,V ( x N )) T be
the vector of size N that contains the values of the value function V for each
state x n .Let V and V μ
denote the vectors that contain the optimal value
function V
and the value function V μ
for policy μ , respectively. Similarly, let
P μ
x i )) denote the transition matrix of the Markov chain for a fixed
policy μ ,andlet r μ =( r x 1 ,...,r x N ) T be the vector consisting of the expec-
ted rewards when following this policy. With these definitions, we can rewrite
Bellman's Equation for a fixed policy (9.7) by
=( p ( x j |
V μ = r μ + γ P μ V μ .
(9.8)
This notation is used extensively in further developments.
9.2
Dynamic Programming and Reinforcement Learning
Recall that in order to find the optimal policy μ , we aim at learning the optimal
value function V by (9.6), or the optimal action-value function Q for cases
where the expectation in (9.6) and (9.3) is hard or impossible to evaluate.
In this section, some common RL methods are introduced, that learn these
functions while traversing the state space without building a model of the tran-
sition and reward function. These methods are simulation-based approximations
to DP methods, and their stability is determined by the stability of the corre-
sponding DP method. These DP methods are introduced firstly, after which RL
methods are derived from them.
9.2.1
Dynamic Programming Operators
Bellman's Equation (9.6) is a set of equations that cannot be solved analytically.
Fortunately, several methods have been developed that make finding its solution
easier, all of which are based on the DP operators T and T μ .
 
Search WWH ::




Custom Search