On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

5

Partial Observability

The theoretical results concerning equivalence of MDPs/SGs and its engineered

reward-based counterparts are not surprising as, by definition, agents are able

to observe the state and the actions of other agents (see e.g. [19,4]). Accordingly,

no additional information can be added by using engineered rewards. In the area

of planing and learning, so-called partially observable Markov decision processes

(POMDP)[9], decentralized POMDPs (Dec-POMDP) [2,17,16], and partially ob-

servable stochastic games (POSG) [3] are used to model systems where agents

are unable to observe the full state of the environment, or obtain noisy, proba-

bly erroneous information. Like in MDPs/SGs, these models also supply agents

with a global reward and a sort of state information in the form of observations.

We now investigate such settings and, like before, concentrate on single agent

settings by considering POMDPs, which formally can be defined as follows [9]:

Definition 4 (POMDP). A partially observable Markov decision process can

be described as a tuple

S,A,δ,R,Ω,O

,where

S,A,δ,R are defined as in a

Markov decision process and

- Ω is a finite set of observations the agent can make from the environment.

- O :

S×A → Π ( Ω ) is an observation function that returns a probability

distribution over possible observations. Let O ( s, a, o ) denote the probability

that observation o ∈ Ω is observed after action a was executed in state s .

Although an agent in this domain can only make observations on the current

environment state, it still aims at maximizing the return based on the rewards

it receives. Figure 2 visualizes a general POMDP. The agent percepts an obser-

vation based on the current state of the environment and estimates the actual

environment state using its own history of observations and actions. This estima-

tion is called belief state and is calculated in the state estimation (SE) element

shown in the figure. More details on partially observable domains and solution

techniques can be found in e.g. [2,3,9,16,17].

As can be seen in Fig. 2, an agent obtains a global reward that is calculated

within the environment. Note that in order to formally define a reward function,

states and actions have to be known. For partially observable domains with more

than one agent (e.g. Dec-POMDPs or POSGs), this also implies knowledge about

joint actions. Please note that all this knowledge hence has to be available at

the (central) environment level in order to define an ordinary reward function.

Since these information are available and because an engineered reward function

does not require more information, it is formally allowed and — according to

Sect. 3.2 — possible to use polynomial time engineered rewards in state of the

art models. Thus, a reward signal that encodes all information available at the

central environment level can be constructed.

Since a POMDP with engineered rewards can be constructed in polynomial

time, the computational complexity of this extension is equal to the ordinary

POMDP model. However, as engineered reward signals can encode arbitrary

additional information, an agent in an engineered reward-based POMDP can

Multiagent System Technologies

Search WWH ::

Custom Search

Home