On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

easily reconstruct the code word code ( s )# code ( r ). Finally, this word is split

at the separator (#) and the resulting parts represent the original state resp.

reward. Obviously, this decoding also runs in polynomial time. By construction

it holds that dec ( enc ( R ( s, a ) ,s )) = ( R ( s, a ) ,s ). Clearly, other polynomial time

approaches exist and can be used if they obey the requirements stated above.

Complexity Results and Expressiveness

Based on [6] and [8], we first briefly recall some theoretical foundations. Com-

plexity theory classifies computational problems into complexity classes. For

convenience, important techniques are often designed for decision problems, i.e.

problems whose solution is either yes or no . Reinforcement learning, however,

is not a decision problem but an optimization problem, as agents seek to max-

imize the expected return. Hence, to investigate RL problems using complexity

theory, we have to convert them into decision problems. A simple and common

technique is to add a bound B to an optimization problem O and to ask if O

can be solved such that the optimization objective is at least (or at most) B .

Clearly, such a converted problem is at most as hard as the original problem.

Different complexity classes exist, among which

and

are fundamental.

Decision problems that belong to the class

are solvable in polynomial time,

while those belonging to

can be solved using nondeterministic polynomial

time algorithms. It is known that

To show that two problems fall into the same complexity class, a technique

called polynomial time reduction is used. Therefore, one needs a polynomial

time reduction function f : O → O that i) transforms any instance o ∈ O of a

decision problem into an instance f ( o )

P⊆NP

and widely believed that

∈ O , and which ii) ensures that if and

only if o evaluates to yes then also f ( o )evaluatesto yes .Wewrite O ≤ p O to

denote that a problem O is polynomial time reducible to a problem O .

In the remainder of this section, we will show that the proposed concep-

tual model is equivalent to the state of the art model that provides both, state

and reward signals to the agent. To do so, we first have to formally define our

understanding of equivalent models. Therefore, we adopt the same notion and

approach as Seuken and Zilberstein [17] and use polynomial time reductions:

Definition 1. Two models M 1 and M 2 are said to be polynomial time equiva-

lent if and only if M 1 ≤ p M 2 and M 2 ≤ p M 1 .

By this definition, two models are polynomial time equivalent if they can express

the same problems and belong to the same complexity class.

Next, we formally define the models that were illustrated in the previous

section. There, Fig. 1(a) basically illustrates a Markov decision process (MDP)

which, based on [9], can be defined as follows:

Definition 2. (Markov Decision Process) A Markov decision process (MDP) is

defined as a tuple

is a finite set of world states and A is

a finite set of possible actions. The state-transition function δ :

S,A,δ,R

,where

S×A → Π (

)

Multiagent System Technologies

Search WWH ::

Custom Search

Home