On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

construct an engineered reward which encodes these information. Therefore, the

previously stressed coding approach can be used. Since we deal with a cooper-

ative game and each agent then would obtain full state information, it could

simply consider the game as a MDP where actions are joint actions and solve

the problem using Q-Learning (see e.g. [14]). Using the engineered reward-based

concept, agents then could easily exploit the system which certainly leads to

undesired system properties. For instance privacy issues as agents would be able

to identify and track a particular other mobile device. As soon as we use rewards

that fulfill Prop. 1, the system can no longer be exploited in this way.

Although in this work, the engineered reward-based concept is used to

emphasize the risks that stem from such rewards, the concept also allows to

eciently solve games with a special structure. An example can be found in

[11], where a novel game class called sequential stage games and a near opti-

mal MARL approach were introduced. That algorithm benefits from engineered

rewards as it can solely rely on the order of magnitude of the reward values to dis-

tinguish between different “states” of the game, which leads to a space-ecient

approach.

7

Discussion and Conclusion

We investigated the expressive power of global rewards in reinforcement learn-

ing. Particularly, we focused on rewards that are calculated by the environment,

i.e. by that entity which by definition has full information about states and joint

actions. Using a polynomial time coding we showed how engineered rewards con-

taining full information can be constructed. Though this technique does not add

more expressiveness to models with full observability, it offers a way to provide

agents with full information in partially observable domains. To prevent this

undesired knowledge transfer, we argue that reward signals should only contain

true rewards but never additional polynomial time decodable information.

Note that equivalence results for fully observable models hold only if, as as-

sumed in this work, engineered rewards solely encode state information which

possibly include information on (joint) actions. In general, however, the concept

allows to encode arbitrary information, e.g. state transition probabilities, obser-

vation functions, etc., into a single numerical reward. Under such conditions, the

engineered reward-based concept basically can reduce agent and environment to

a single unit where all information are accessible. Clearly, the entity that gives

(and calculates) rewards then is very powerful and should be handled with care.

In that context, the contribution of this work is to point out that rewards should

be designed carefully to avoid infringement of model assumptions.

For the future, it remains to investigate the influence of engineered rewards

in other models. In addition, it should be investigated how engineered rewards

can be used to solve problems with special structures as shown in the

example.

Search WWH ::

Custom Search

Home