On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

from the environment. To put it simply, such a reward rates the quality of the

performed action with respect to reaching a certain goal.

In this work, we will concentrate on the power of such environment-based

rewards that play the same role in all models. Note that we assume anything

that is not controllable by the agent as being part of the environment. Basically,

there are two different possibilities for calculating the reward [14]. First, a re-

ward function can be build into the agent, i.e. an agent itself calculates rewards

based on the observed state and its action. This approach is usually followed in

literature, for instance in robotic teams [18]. The second option, in which we are

particularly interested, is to let the environment calculate and give rewards to

the agent [2]. In this setting, the environment can be considered as a central in-

stance or an entity like a teacher that gives the rewards. Consider, for example, a

telecommunication system involving large numbers of mobile devices with local

view on the system and several radio base stations which are connected through

a backbone network. In this scenario, a central computer would be able to mea-

sure a global performance measure like a load factor that should be optimized

by the mobile devices. Thus, the computer could realize the entity which calcu-

lates the rewards that are given to the agents (mobile devices). We will show

how such a setting could be exploited by engineered reward signals that do not

only represent a numerical reward, but also encode additional information which

should not be available locally. It is then proven that engineered rewards do not

add additional expressiveness in fully observable domains. However,they enable

agents in partially observable domains like the above introduced telecommuni-

cation setting, to obtain information which the agents are not supposed to get.

In order to formally avoid this, we then argue that existing models should de-

mand rewards that do not contain polynomial time decodable information. Note,

that this work thus does not claim to present novel engineered reward-based al-

gorithms which deal with well known problems of reinforcement learning, e.g.

coordination or convergence issues, but investigates the power of global rewards.

The next section briefly presents related work. Then, in Sect. 3 we review

the state of the art concept used in reinforcement learning and introduce the

engineered reward-based concept. In Sect. 4, the latter is shown to be equivalent

to the ordinary fully observable model in terms of complexity and expressiveness.

Next, in Sect. 5 we consider partially observable domains in conjunction with

engineered rewards and show that an additional assumption on reward signals

is necessary to avoid exploitation by encoded informations. Then, in Sect. 6, we

provide a clarifying example and discuss how engineered rewards can be used

beneficially in problems with special structures. Finally, we conclude in Sect. 7.

2

Related Work

First of all, note that we assume the readers to be familiar with reinforcement

learning (RL). Thus, for an introduction, we refer to [10,14,19] for single agent

RL, and to [4,16,17] for multiagent RL.

A key assumption of most models used in reinforcement learning is that the

environment has to be Markovian. This means that transition probabilities do

Multiagent System Technologies

Search WWH ::

Custom Search

Home