Information Technology Reference
In-Depth Information
On the Power of Global Reward Signals in
Reinforcement Learning
Thomas Kemmerich 1 and Hans Kleine Buning 2
1 International Graduate School Dynamic Intelligent Systems
University of Paderborn,
33095 Paderborn, Germany
2 Department of Computer Science
University of Paderborn,
33095 Paderborn, Germany
{ kemmerich,kbcsl } @uni-paderborn.de
Abstract. Reinforcement learning is investigated in various models, in-
volving single and multiagent settings as well as fully or partially ob-
servable domains. Although such models differ in several aspects, their
basic approach is identical: agents obtain a state observation and a global
reward signal from an environment and execute actions which in turn in-
fluence the environment state. In this work, we discuss the role of such
global reward signals. We present a concept that does not provide a vis-
ible environment state but only offers a numerical engineered reward. It
will be proven that this approach has the same computational complexity
and expressive power as ordinary fully observable models, but allows to
infringe assumptions in models with partial observability. To avoid such
infringements, we then argue that rewards, besides a true reward value,
shall never contain additional polynomial time decodable information.
Keywords: reinforcement learning, global reward, conceptual models,
partial observability.
1
Introduction
Reinforcement learning in single and multiagent systems (MAS) can be realized
based on different formal models. The model choice depends on the assumed
agent abilities or on the requirements of the underlying problem domain. A
large amount of work deals with Markov decision processes (MDP) or stochastic
games (SG), where agents are supposed to observe the entire environment state
as well as the actions of other agents (see e.g. [19] and [4] for introductions). In
contrast, in partially observable MDPs [9] or partially observable SGs [3], it is
assumed that agents are unable to observe everything but only perceive (small)
excerpts. These models are also used for planing and learning under uncertainty.
Despite different assumptions, e.g. on observability, the basic approach is the
same in all models: agents make an observation, decide to execute a specific
action that changes the state of the environment, and finally obtain a reward
 
Search WWH ::




Custom Search