Information Technology Reference
In-Depth Information
3
Markov Decision Processes
An MDP is defined by a transition function
T
and cost function
C
. The probability of
transitioning from state
s
to state
s
after executing action
a
is given by
T
(
s, a, s
)
.The
immediate cost when executing
a
from
s
is given by
C
(
s, a
)
. In this paper, the state
space
S
and action space
A
are finite [21,3].
A policy is a function
π
that maps states to actions. The expected sum of immediate
costs when following
π
for
K
steps starting from state
s
is denoted
J
K
(
s
)
, often called
the cost-to-go function. The solution to an MDP with a horizon of
K
is a policy
π
∗
K
that minimizes the cost to go from every state.
One way to compute
π
∗
K
is to first compute
J
K
, the cost-to-go function for the op-
timal policy, using a dynamic programming algorithm known as value iteration. The
function
J
0
(
s
)
is set to zero for all states
s
. If the state space includes terminal states
with immediate cost
C
(
s
)
,then
J
0
(
s
)=
C
(
s
)
for those terminal states. The function
J
k
(
s
)
is computed from
J
k−
1
as follows:
C
(
s, a
)+
s
T
(
s, a, s
)
J
k−
1
(
s
)
.
J
k
(
s
)=min
a
(1)
The iteration continues until horizon
K
.
The expected cost to go when executing
a
from
s
and then continuing with an optimal
policy for
K
−
1
steps is given by
J
K
(
s, a
)=
C
(
s, a
)+
s
T
(
s, a, s
)
J
K−
1
(
s
)
.
(2)
An optimal policy may be obtained directly from
J
K
(
s, a
)
:
π
∗
K
(
s
)=argmin
a
J
K
(
s, a
)
.
(3)
If the state space contains continuous variables, which is common for collision avoid-
ance problems, the state space can be discretized using a multi-dimensional grid or
simplex scheme [9]. The transition function
T
(
x
,a,
x
)
in continuous space can be
translated into a discrete transition function
T
(
s, a, s
)
using a variety of sampling and
interpolation methods [15]. Once the state space, transition function, and cost function
have been discretized,
J
∗
(
s, a
)
may be computed for each discrete state
s
and action
a
. For a continuous state
x
and action
a
,
J
∗
(
x
,a
)
may be approximated using, for ex-
ample, multilinear interpolation. The best action to execute from continuous state
x
is
simply
argmin
a
J
∗
(
x
,a
)
.
Discretizing the full state space can result in a large number of discrete states, expo-
nential in the number of dimensions, which makes computing
J
∗
infeasible for many
problems. This “curse of dimensionality” [1] has led to a variety of different approxi-
mation methods [20].
4
Partial Control
This paper explores a new solution technique for partially controlled MDPs that is appli-
cable to certain collision avoidance problems. It may be applied to interception-seeking