On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

Fig. 2. Visualization of a POMDP(based on Kaelbling et al. [9])

benefit from information that are not observable in an ordinary POMDP. Ac-

cordingly, engineered reward-based partially observable Markov decision pro-

cesses (ERBPOMDP) establish a richer concept compared to ordinary POMDPs.

Despite the intended partial observability, an agent in an ERBPOMDP is able

to obtain additional information and, thus, can easily exploit the framework.

Obviously, encoding additional information into an engineered reward is ex-

plicitly not desired in a model based on the partial observability assumption.

In order to (formally) prevent exploitation by engineered rewards, we propose

to slightly adjust existing models. As stated in the preceding sections, the key

requirement which enables agents to work with additional information is that it

must be possible to eciently decode engineered rewards. More formally, it is

required that agents are able to decompose an engineered reward into its compo-

nents in polynomial time. To prevent exploitation of partially observable models

by agents that obtain global information from a global engineered reward, we

shall demand one additional property on the reward function in existing models:

Property 1. Besides the true reward, reward signals in partially observable do-

mains are not allowed to carry additional polynomial time decodable information.

Note that this property intensionally does not explicitly define what “additional

information” are, as this depends on the considered model. We will come back

to this issue later in Sect. 7.

6Examp e

Next, we consider an example application modeled as decentralized partially

observable Markov decision process, which, based on [2], is defined as follows:

Definition 5 (Dec-POMDP). A decentralized partially-observable Markov

decision process (Dec-POMDP) is defined by a tuple

S, A ,δ,R, Ω ,O

,where

S

is a finite set of states with distinguished initial state s 0 .

-

× i∈AG A i is a joint action set, where A i denotes actions of agent i .

- δ is a state transition function, where δ ( s, a ,s ) denotes the probability of

transitioning from state s to state s when joint action

A

=

is executed.

- R isarewardfunction. R ( s, a ,s ) is the reward obtained from executing joint

action

a ∈ A

in state s and transitioning to state s .

a

Search WWH ::

Custom Search

Home