On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

not depend on the state and action history but only on the current state and

action. Various models for RL that are based on the Markov property have been

investigated in the past, see e.g. [2,3,4,9,16,17] and its references for introduc-

tory descriptions. Notably, in all these models, agents obtain a numerical global

reward signal which they use to learn a policy. Thus, the results of this paper

might also be extended to and investigated in other models.

A technique that is related to the approach followed in this work is known as

reward shaping . This technique tries to fasten the learning process by supplying

an agent with additional rewards that guide it towards the learning goal. Such

rewards must be designed carefully so as to avoid learning undesired policies and

to avoid including non-observable information. A work of Melo and Ribeiro [13],

for instance, encodes entropy information of a fully observable MDP into the

rewards to solve a robotic navigation task that is modeled as POMDP. Ng et

al. [15] investigate the requirements under which shaped rewards maintain opti-

mal policies. While the goal of reward shaping is to guide the learning process,

our concept allows to transmit probably non-observable information through

the engineered reward. Thus, agents in our approach might use such additional

knowledge but will rely on original rewards for learning. Recent work also deals

with reward shaping in multiagent systems [7].

Another aspect of global rewards deals with the question of how well it can

support learning in general. Particularly, learning from a common global reward

in large MAS is hard, as this often involves several problems like coordination

or credit assignment questions. For MAS, Bagnell and Ng [1] prove a worst-case

lower bound indicating that learning from a global rewards requires roughly as

many examples as there are agents in the system. They show that learning from

local rewards can be faster, requiring only logarithmically many trajectories.

Chang et al. [5] consider the role of global rewards in MAS. From an individual

agent's point of view, they consider the global reward as a sum of the agent's

true reward and a random Markov process that describes the influence of other

factors or agents to the global reward. Using Kalman filtering, they develop a

simple approach to estimate local rewards from a single global reward signal.

3Coneps

Next, we briefly review the current state of the art concept for reinforcement

learning. It is used in the aforementioned models and provides a state and a

reward signal. Thereafter, a novel concept is given that gets along with a single

engineered reward signal. Since this concept can easily be extended to MAS,

we will concentrate on single agent settings and refer to the multiagent case if

necessary. Note that throughout this text, we make common assumptions like

bounded rewards, finite state sets, and finite state representations.

3.1 State of the Art

Reinforcement learning problems are mostly modeled as Markov decision pro-

cesses (single agent systems) resp. as stochastic games (multiagent systems) [4],

Search WWH ::

Custom Search

Home