Information Technology Reference
In-Depth Information
5
Partial Observability
The theoretical results concerning equivalence of MDPs/SGs and its engineered
reward-based counterparts are not surprising as, by definition, agents are able
to observe the state and the actions of other agents (see e.g. [19,4]). Accordingly,
no additional information can be added by using engineered rewards. In the area
of planing and learning, so-called partially observable Markov decision processes
(POMDP)[9], decentralized POMDPs (Dec-POMDP) [2,17,16], and partially ob-
servable stochastic games (POSG) [3] are used to model systems where agents
are unable to observe the full state of the environment, or obtain noisy, proba-
bly erroneous information. Like in MDPs/SGs, these models also supply agents
with a global reward and a sort of state information in the form of observations.
We now investigate such settings and, like before, concentrate on single agent
settings by considering POMDPs, which formally can be defined as follows [9]:
Definition 4 (POMDP). A partially observable Markov decision process can
be described as a tuple
S,A,δ,R,Ω,O
,where
S,A,δ,R are defined as in a
Markov decision process and
- Ω is a finite set of observations the agent can make from the environment.
- O :
S×A → Π ( Ω ) is an observation function that returns a probability
distribution over possible observations. Let O ( s, a, o ) denote the probability
that observation o ∈ Ω is observed after action a was executed in state s .
Although an agent in this domain can only make observations on the current
environment state, it still aims at maximizing the return based on the rewards
it receives. Figure 2 visualizes a general POMDP. The agent percepts an obser-
vation based on the current state of the environment and estimates the actual
environment state using its own history of observations and actions. This estima-
tion is called belief state and is calculated in the state estimation (SE) element
shown in the figure. More details on partially observable domains and solution
techniques can be found in e.g. [2,3,9,16,17].
As can be seen in Fig. 2, an agent obtains a global reward that is calculated
within the environment. Note that in order to formally define a reward function,
states and actions have to be known. For partially observable domains with more
than one agent (e.g. Dec-POMDPs or POSGs), this also implies knowledge about
joint actions. Please note that all this knowledge hence has to be available at
the (central) environment level in order to define an ordinary reward function.
Since these information are available and because an engineered reward function
does not require more information, it is formally allowed and — according to
Sect. 3.2 — possible to use polynomial time engineered rewards in state of the
art models. Thus, a reward signal that encodes all information available at the
central environment level can be constructed.
Since a POMDP with engineered rewards can be constructed in polynomial
time, the computational complexity of this extension is equal to the ordinary
POMDP model. However, as engineered reward signals can encode arbitrary
additional information, an agent in an engineered reward-based POMDP can
Search WWH ::




Custom Search