Information Technology Reference
In-Depth Information
Fig. 2. Visualization of a POMDP(based on Kaelbling et al. [9])
benefit from information that are not observable in an ordinary POMDP. Ac-
cordingly, engineered reward-based partially observable Markov decision pro-
cesses (ERBPOMDP) establish a richer concept compared to ordinary POMDPs.
Despite the intended partial observability, an agent in an ERBPOMDP is able
to obtain additional information and, thus, can easily exploit the framework.
Obviously, encoding additional information into an engineered reward is ex-
plicitly not desired in a model based on the partial observability assumption.
In order to (formally) prevent exploitation by engineered rewards, we propose
to slightly adjust existing models. As stated in the preceding sections, the key
requirement which enables agents to work with additional information is that it
must be possible to eciently decode engineered rewards. More formally, it is
required that agents are able to decompose an engineered reward into its compo-
nents in polynomial time. To prevent exploitation of partially observable models
by agents that obtain global information from a global engineered reward, we
shall demand one additional property on the reward function in existing models:
Property 1. Besides the true reward, reward signals in partially observable do-
mains are not allowed to carry additional polynomial time decodable information.
Note that this property intensionally does not explicitly define what “additional
information” are, as this depends on the considered model. We will come back
to this issue later in Sect. 7.
6Examp e
Next, we consider an example application modeled as decentralized partially
observable Markov decision process, which, based on [2], is defined as follows:
Definition 5 (Dec-POMDP). A decentralized partially-observable Markov
decision process (Dec-POMDP) is defined by a tuple
S, A ,δ,R, Ω ,O
,where
S
is a finite set of states with distinguished initial state s 0 .
-
-
× i∈AG A i is a joint action set, where A i denotes actions of agent i .
- δ is a state transition function, where δ ( s, a ,s ) denotes the probability of
transitioning from state s to state s when joint action
A
=
is executed.
- R isarewardfunction. R ( s, a ,s ) is the reward obtained from executing joint
action
a A
in state s and transitioning to state s .
a
 
Search WWH ::




Custom Search