Information Technology Reference
In-Depth Information
not depend on the state and action history but only on the current state and
action. Various models for RL that are based on the Markov property have been
investigated in the past, see e.g. [2,3,4,9,16,17] and its references for introduc-
tory descriptions. Notably, in all these models, agents obtain a numerical global
reward signal which they use to learn a policy. Thus, the results of this paper
might also be extended to and investigated in other models.
A technique that is related to the approach followed in this work is known as
reward shaping . This technique tries to fasten the learning process by supplying
an agent with additional rewards that guide it towards the learning goal. Such
rewards must be designed carefully so as to avoid learning undesired policies and
to avoid including non-observable information. A work of Melo and Ribeiro [13],
for instance, encodes entropy information of a fully observable MDP into the
rewards to solve a robotic navigation task that is modeled as POMDP. Ng et
al. [15] investigate the requirements under which shaped rewards maintain opti-
mal policies. While the goal of reward shaping is to guide the learning process,
our concept allows to transmit probably non-observable information through
the engineered reward. Thus, agents in our approach might use such additional
knowledge but will rely on original rewards for learning. Recent work also deals
with reward shaping in multiagent systems [7].
Another aspect of global rewards deals with the question of how well it can
support learning in general. Particularly, learning from a common global reward
in large MAS is hard, as this often involves several problems like coordination
or credit assignment questions. For MAS, Bagnell and Ng [1] prove a worst-case
lower bound indicating that learning from a global rewards requires roughly as
many examples as there are agents in the system. They show that learning from
local rewards can be faster, requiring only logarithmically many trajectories.
Chang et al. [5] consider the role of global rewards in MAS. From an individual
agent's point of view, they consider the global reward as a sum of the agent's
true reward and a random Markov process that describes the influence of other
factors or agents to the global reward. Using Kalman filtering, they develop a
simple approach to estimate local rewards from a single global reward signal.
3Coneps
Next, we briefly review the current state of the art concept for reinforcement
learning. It is used in the aforementioned models and provides a state and a
reward signal. Thereafter, a novel concept is given that gets along with a single
engineered reward signal. Since this concept can easily be extended to MAS,
we will concentrate on single agent settings and refer to the multiagent case if
necessary. Note that throughout this text, we make common assumptions like
bounded rewards, finite state sets, and finite state representations.
3.1 State of the Art
Reinforcement learning problems are mostly modeled as Markov decision pro-
cesses (single agent systems) resp. as stochastic games (multiagent systems) [4],
Search WWH ::




Custom Search