Information Technology Reference
In-Depth Information
gives a probability distribution to be in state
s
∈S
after action
a ∈ A
in state
s ∈S
was executed. The reward function
R
:
S×A →
R
returns an immediate
reward for executing action
a ∈ A
in state
s ∈S
. The agent executes actions from
A
and is able to observe the current state of the environment and the reward.
As mentioned above, an MDP
M
=
can be formulated as decision
problem by adding a bound
B
. Thus for an MDP
M
S,A,δ,R
the
question is whether or not there is a policy that results in a return of at least
B
starting from the initial state
s
0
∈S
=
S,A,δ,R,B
.
The engineered reward-based approach for single agent systems visualized in
Fig. 1(b) formally is defined as follows:
Definition 3.
(Engineered Reward-Based Markov Decision Process) An
engi-
neered reward-based Markov decision process
(ERBMDP) is defined by a tu-
ple
E
=
S,A,δ,R,
enc
,
dec
,where
S,A,δ,R
are defined as in an MDP.
enc
:
is a polynomial time function that calculates an engineered
reward which encodes original rewards and states into a single real value. Agents
can only observe the engineered reward, and are able to decode these rewards
using a polynomial time decoding function
dec
:
R
×S→
R
R
→
R
×S
. It holds that
dec
(
enc
(
R
(
s, a
)
,s
)) = (
R
(
s, a
)
,s
)
.
Using these definitions and the general coding approach presented in the previous
section, we can now show the following equivalence result:
Theorem 1.
Engineered reward-based Markov decision processes (ERBMDP)
and Markov decision processes (MDP) are equivalent.
Proof.
We have to show that the decision problem variants are mutually re-
ducible in polynomial time, i.e. ERBMDP
≤
p
ERBMDP.
To construct a decision problem, we add bounds
B
/
B
to the respective tuples.
1. ERBMDP
≤
p
MDP and MDP
≤
p
MDP: An arbitrary ERBMDP
E
=
S,A,δ,R,
enc
,
dec
,B
S
,A
,δ
,R
,B
can be transformed into an MDP
M
=
in polynomial time
S,A
=
A, δ
=
δ, R
=
R
,and
B
=
B
. The agents also have
to be allowed to observe the state signal. Since the encoding and decoding
function,
enc
and
dec
, are only required to construct the engineered reward
and because
dec
(
enc
(
R
(
s, a
)
,s
)) = (
R
(
s, a
)
,s
) holds by definition, these
functions have no influence on the expected return. Thus, whenever a policy
in
E
results in a return of at least
B
, this also holds for the MDP
M
.
2. MDP
S
=
by setting
≤
p
ERBMDP: Let
M
=
S,A,δ,R,B
be an arbitrary MDP. Then,
S
,A
,δ
,R
,
enc
,
dec
,B
an ERBMDP
E
=
can be obtained by forbid-
ding the agent to observe the state signal and by setting
S,A
=
A, δ
=
δ, R
=
R
,and
B
=
B
. It remains to show that there are polynomial
time algorithms for encoding and decoding such that
dec
(
enc
(
R
(
s, a
)
,s
)) =
(
R
(
s, a
)
,s
) holds—this was done in Sect. 3.2. Finally, as before, a policy for
M
that results in a return of at least
B
will result in the same return in
E
.
S
=
It is easy to see that the same result also holds for fully observable multiagent
settings, i.e. stochastic games and engineered reward-based stochastic games, as
they are constructed analogously to the single agent case using joint actions.