Information Technology Reference
In-Depth Information
3
Markov Decision Processes
An MDP is defined by a transition function T and cost function C . The probability of
transitioning from state s to state s after executing action a is given by T ( s, a, s ) .The
immediate cost when executing a from s is given by C ( s, a ) . In this paper, the state
space S and action space A are finite [21,3].
A policy is a function π that maps states to actions. The expected sum of immediate
costs when following π for K steps starting from state s is denoted J K ( s ) , often called
the cost-to-go function. The solution to an MDP with a horizon of K is a policy π K
that minimizes the cost to go from every state.
One way to compute π K is to first compute J K , the cost-to-go function for the op-
timal policy, using a dynamic programming algorithm known as value iteration. The
function J 0 ( s ) is set to zero for all states s . If the state space includes terminal states
with immediate cost C ( s ) ,then J 0 ( s )= C ( s ) for those terminal states. The function
J k ( s ) is computed from J k− 1 as follows:
C ( s, a )+
s
T ( s, a, s ) J k− 1 ( s ) .
J k ( s )=min
a
(1)
The iteration continues until horizon K .
The expected cost to go when executing a from s and then continuing with an optimal
policy for K
1 steps is given by
J K ( s, a )= C ( s, a )+
s
T ( s, a, s ) J K− 1 ( s ) .
(2)
An optimal policy may be obtained directly from J K ( s, a ) :
π K ( s )=argmin
a
J K ( s, a ) .
(3)
If the state space contains continuous variables, which is common for collision avoid-
ance problems, the state space can be discretized using a multi-dimensional grid or
simplex scheme [9]. The transition function T ( x ,a, x ) in continuous space can be
translated into a discrete transition function T ( s, a, s ) using a variety of sampling and
interpolation methods [15]. Once the state space, transition function, and cost function
have been discretized, J ( s, a ) may be computed for each discrete state s and action
a . For a continuous state x and action a , J ( x ,a ) may be approximated using, for ex-
ample, multilinear interpolation. The best action to execute from continuous state x is
simply argmin a J ( x ,a ) .
Discretizing the full state space can result in a large number of discrete states, expo-
nential in the number of dimensions, which makes computing J infeasible for many
problems. This “curse of dimensionality” [1] has led to a variety of different approxi-
mation methods [20].
4
Partial Control
This paper explores a new solution technique for partially controlled MDPs that is appli-
cable to certain collision avoidance problems. It may be applied to interception-seeking
 
Search WWH ::




Custom Search