Information Technology Reference
In-Depth Information
gives a probability distribution to be in state s ∈S
after action a ∈ A in state
s ∈S
was executed. The reward function R :
S×A → R
returns an immediate
reward for executing action a ∈ A in state s ∈S
. The agent executes actions from
A and is able to observe the current state of the environment and the reward.
As mentioned above, an MDP M =
can be formulated as decision
problem by adding a bound B . Thus for an MDP M
S,A,δ,R
the
question is whether or not there is a policy that results in a return of at least B
starting from the initial state s 0 ∈S
=
S,A,δ,R,B
.
The engineered reward-based approach for single agent systems visualized in
Fig. 1(b) formally is defined as follows:
Definition 3. (Engineered Reward-Based Markov Decision Process) An engi-
neered reward-based Markov decision process (ERBMDP) is defined by a tu-
ple E =
S,A,δ,R, enc , dec
,where
S,A,δ,R are defined as in an MDP.
enc :
is a polynomial time function that calculates an engineered
reward which encodes original rewards and states into a single real value. Agents
can only observe the engineered reward, and are able to decode these rewards
using a polynomial time decoding function dec :
R ×S→ R
R R ×S
. It holds that
dec ( enc ( R ( s, a ) ,s )) = ( R ( s, a ) ,s ) .
Using these definitions and the general coding approach presented in the previous
section, we can now show the following equivalence result:
Theorem 1. Engineered reward-based Markov decision processes (ERBMDP)
and Markov decision processes (MDP) are equivalent.
Proof. We have to show that the decision problem variants are mutually re-
ducible in polynomial time, i.e. ERBMDP
p ERBMDP.
To construct a decision problem, we add bounds B / B to the respective tuples.
1. ERBMDP
p
MDP and MDP
p MDP: An arbitrary ERBMDP E =
S,A,δ,R, enc , dec ,B
S ,A ,R ,B
can be transformed into an MDP M =
in polynomial time
S,A = A, δ = δ, R = R ,and B = B . The agents also have
to be allowed to observe the state signal. Since the encoding and decoding
function, enc and dec , are only required to construct the engineered reward
and because dec ( enc ( R ( s, a ) ,s )) = ( R ( s, a ) ,s ) holds by definition, these
functions have no influence on the expected return. Thus, whenever a policy
in E results in a return of at least B , this also holds for the MDP M .
2. MDP
S =
by setting
p
ERBMDP: Let M =
S,A,δ,R,B
be an arbitrary MDP. Then,
S ,A ,R , enc , dec ,B
an ERBMDP E =
can be obtained by forbid-
ding the agent to observe the state signal and by setting
S,A =
A, δ = δ, R = R ,and B = B . It remains to show that there are polynomial
time algorithms for encoding and decoding such that dec ( enc ( R ( s, a ) ,s )) =
( R ( s, a ) ,s ) holds—this was done in Sect. 3.2. Finally, as before, a policy for
M that results in a return of at least B will result in the same return in E .
S
=
It is easy to see that the same result also holds for fully observable multiagent
settings, i.e. stochastic games and engineered reward-based stochastic games, as
they are constructed analogously to the single agent case using joint actions.
Search WWH ::




Custom Search