Information Technology Reference
In-Depth Information
or goal-oriented problems as well by incorporating negative costs. So long as the prob-
lem satisfies a set of assumptions, this solution method will provide a finite-horizon
solution. The approach involves independently solving a controlled subproblem and an
uncontrolled subproblem and combining the results online to identify the approximately
optimal action.
4.1
Assumptions
It is assumed that the state is represented by a set of variables, some controlled and some
uncontrolled. The state space of the controlled variables is denoted
S
c
, and the state
space of the uncontrolled variables is denoted
S
u
. The state of the controlled variables
at time
t
is denoted
s
c
(
t
)
, and the state of the uncontrolled variables at time
t
is denoted
s
u
(
t
)
. The solution technique may be applied when the following three assumptions
hold:
1. The state
s
u
(
t
+1)
depends only upon
s
u
(
t
)
. The probability of transitioning from
s
u
to
s
u
is given by
T
(
s
u
,s
u
)
.
2. The episode terminates when
s
u
∈
S
u
with immediate cost
C
(
s
c
)
.
3. In nonterminal states, the immediate cost
c
(
t
+1)
depends only upon
s
c
(
t
)
and
a
(
t
)
. If the controlled state is
s
c
and action
a
is executed, the immediate cost is
denoted
C
(
s
c
,a
)
.
G
⊂
4.2
Controlled Subproblem
Solving the controlled subproblem involves computing the optimal policy for the con-
trolled variables under the assumption that the time until
s
u
enters
G
, denoted
τ
,is
known. In an airborne collision avoidance context,
τ
may be the number of steps until
another aircraft comes within 500 ft horizontally. Of course,
τ
cannot be determined
exactly from
s
u
(
t
)
because it depends upon an event that occurs in the future, but this
will be addressed by the uncontrolled subproblem (Section 4.3).
The cost to go from
s
c
given
τ
is denoted
J
τ
(
s
c
)
.Theseries
J
0
,...,J
K
is computed
recursively, starting with
J
0
(
s
c
)=
C
(
s
c
)
and iterating as follows:
⎡
⎣
C
(
s
c
,a
)+
s
c
⎤
⎦
.
T
(
s
c
,a,s
c
)
J
k−
1
(
s
c
)
J
k
(
s
c
)=min
a
(4)
The expected cost to go from
s
c
when executing
a
for one step and then following the
optimal policy is given by
J
k
(
s
c
,a
)=
C
(
s
c
,a
)+
s
c
T
(
s
c
,a,s
c
)
J
k−
1
(
s
c
)
.
(5)
The
K
-step expected cost to go when
τ>K
is denoted
J
K
. It is computed by ini-
tializing
J
0
(
s
c
)=0
for all states and iterating equation 4 to horizon
K
.Theseries
J
0
,...,J
K
,J
K
is saved in a table in memory, requiring
O
(
K
|
A
||
S
c
|
)
entries.