On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

On the Power of Global Reward Signals in

Reinforcement Learning

Thomas Kemmerich 1 and Hans Kleine Buning 2

1 International Graduate School Dynamic Intelligent Systems

University of Paderborn,

33095 Paderborn, Germany

2 Department of Computer Science

University of Paderborn,

33095 Paderborn, Germany

{ kemmerich,kbcsl } @uni-paderborn.de

Abstract. Reinforcement learning is investigated in various models, in-

volving single and multiagent settings as well as fully or partially ob-

servable domains. Although such models differ in several aspects, their

basic approach is identical: agents obtain a state observation and a global

reward signal from an environment and execute actions which in turn in-

fluence the environment state. In this work, we discuss the role of such

global reward signals. We present a concept that does not provide a vis-

ible environment state but only offers a numerical engineered reward. It

will be proven that this approach has the same computational complexity

and expressive power as ordinary fully observable models, but allows to

infringe assumptions in models with partial observability. To avoid such

infringements, we then argue that rewards, besides a true reward value,

shall never contain additional polynomial time decodable information.

Keywords: reinforcement learning, global reward, conceptual models,

partial observability.

1

Introduction

Reinforcement learning in single and multiagent systems (MAS) can be realized

based on different formal models. The model choice depends on the assumed

agent abilities or on the requirements of the underlying problem domain. A

large amount of work deals with Markov decision processes (MDP) or stochastic

games (SG), where agents are supposed to observe the entire environment state

as well as the actions of other agents (see e.g. [19] and [4] for introductions). In

contrast, in partially observable MDPs [9] or partially observable SGs [3], it is

assumed that agents are unable to observe everything but only perceive (small)

excerpts. These models are also used for planing and learning under uncertainty.

Despite different assumptions, e.g. on observability, the basic approach is the

same in all models: agents make an observation, decide to execute a specific

action that changes the state of the environment, and finally obtain a reward

Search WWH ::

Custom Search

Home