Information Technology Reference
In-Depth Information
from the environment. To put it simply, such a reward rates the quality of the
performed action with respect to reaching a certain goal.
In this work, we will concentrate on the power of such environment-based
rewards that play the same role in all models. Note that we assume anything
that is not controllable by the agent as being part of the environment. Basically,
there are two different possibilities for calculating the reward [14]. First, a re-
ward function can be build into the agent, i.e. an agent itself calculates rewards
based on the observed state and its action. This approach is usually followed in
literature, for instance in robotic teams [18]. The second option, in which we are
particularly interested, is to let the environment calculate and give rewards to
the agent [2]. In this setting, the environment can be considered as a central in-
stance or an entity like a teacher that gives the rewards. Consider, for example, a
telecommunication system involving large numbers of mobile devices with local
view on the system and several radio base stations which are connected through
a backbone network. In this scenario, a central computer would be able to mea-
sure a global performance measure like a load factor that should be optimized
by the mobile devices. Thus, the computer could realize the entity which calcu-
lates the rewards that are given to the agents (mobile devices). We will show
how such a setting could be exploited by engineered reward signals that do not
only represent a numerical reward, but also encode additional information which
should not be available locally. It is then proven that engineered rewards do not
add additional expressiveness in fully observable domains. However,they enable
agents in partially observable domains like the above introduced telecommuni-
cation setting, to obtain information which the agents are not supposed to get.
In order to formally avoid this, we then argue that existing models should de-
mand rewards that do not contain polynomial time decodable information. Note,
that this work thus does not claim to present novel engineered reward-based al-
gorithms which deal with well known problems of reinforcement learning, e.g.
coordination or convergence issues, but investigates the power of global rewards.
The next section briefly presents related work. Then, in Sect. 3 we review
the state of the art concept used in reinforcement learning and introduce the
engineered reward-based concept. In Sect. 4, the latter is shown to be equivalent
to the ordinary fully observable model in terms of complexity and expressiveness.
Next, in Sect. 5 we consider partially observable domains in conjunction with
engineered rewards and show that an additional assumption on reward signals
is necessary to avoid exploitation by encoded informations. Then, in Sect. 6, we
provide a clarifying example and discuss how engineered rewards can be used
beneficially in problems with special structures. Finally, we conclude in Sect. 7.
2
Related Work
First of all, note that we assume the readers to be familiar with reinforcement
learning (RL). Thus, for an introduction, we refer to [10,14,19] for single agent
RL, and to [4,16,17] for multiagent RL.
A key assumption of most models used in reinforcement learning is that the
environment has to be Markovian. This means that transition probabilities do
Search WWH ::




Custom Search