Information Technology Reference
In-Depth Information
construct an engineered reward which encodes these information. Therefore, the
previously stressed coding approach can be used. Since we deal with a cooper-
ative game and each agent then would obtain full state information, it could
simply consider the game as a MDP where actions are joint actions and solve
the problem using Q-Learning (see e.g. [14]). Using the engineered reward-based
concept, agents then could easily exploit the system which certainly leads to
undesired system properties. For instance privacy issues as agents would be able
to identify and track a particular other mobile device. As soon as we use rewards
that fulfill Prop. 1, the system can no longer be exploited in this way.
Although in this work, the engineered reward-based concept is used to
emphasize the risks that stem from such rewards, the concept also allows to
eciently solve games with a special structure. An example can be found in
[11], where a novel game class called sequential stage games and a near opti-
mal MARL approach were introduced. That algorithm benefits from engineered
rewards as it can solely rely on the order of magnitude of the reward values to dis-
tinguish between different “states” of the game, which leads to a space-ecient
approach.
7
Discussion and Conclusion
We investigated the expressive power of global rewards in reinforcement learn-
ing. Particularly, we focused on rewards that are calculated by the environment,
i.e. by that entity which by definition has full information about states and joint
actions. Using a polynomial time coding we showed how engineered rewards con-
taining full information can be constructed. Though this technique does not add
more expressiveness to models with full observability, it offers a way to provide
agents with full information in partially observable domains. To prevent this
undesired knowledge transfer, we argue that reward signals should only contain
true rewards but never additional polynomial time decodable information.
Note that equivalence results for fully observable models hold only if, as as-
sumed in this work, engineered rewards solely encode state information which
possibly include information on (joint) actions. In general, however, the concept
allows to encode arbitrary information, e.g. state transition probabilities, obser-
vation functions, etc., into a single numerical reward. Under such conditions, the
engineered reward-based concept basically can reduce agent and environment to
a single unit where all information are accessible. Clearly, the entity that gives
(and calculates) rewards then is very powerful and should be handled with care.
In that context, the contribution of this work is to point out that rewards should
be designed carefully to avoid infringement of model assumptions.
For the future, it remains to investigate the influence of engineered rewards
in other models. In addition, it should be investigated how engineered rewards
can be used to solve problems with special structures as shown in the
example.
Search WWH ::




Custom Search